Cost, Latency, and Throughput Engineering for Agents

The live demo repo for this series is 67ailab/harness-engineering, and for this post I did change the repo before publishing. The new repo commit is b9a60e8, which adds per-step timing metadata, lightweight workload and token estimates, and performance/cost rollups to the harness traces and summaries.

That change lives mainly in:

src/harness_engineering/models.py
src/harness_engineering/runner.py
src/harness_engineering/tracing.py
src/harness_engineering/store.py
tests/test_harness.py
README.md

The core additions are:

new timing and metrics fields on StepResult in models.py
wall-clock measurement inside RetryPolicy.call() in runner.py
step-level workload estimation in HarnessRunner._estimate_step_metrics()
aggregated performance and cost rollups in build_trace_summary() in tracing.py
operator-facing rollups in RunStore.build_summary() in store.py

This is the right place for Post 12 to land, because cost and latency problems in agent systems almost never come from one bad prompt. They come from system shape:

too many round trips
too many generated tokens
no distinction between cheap and expensive steps
no trace that explains where the time actually went
no measurement surface for workload growth
no visibility into retries, approvals, or slow reviewer/model calls

That is harness engineering.

What changed in the repo since the previous post

Post 11 was about policy, auth, and safe boundaries. That was necessary, but it did not yet give the harness a decent answer to another practical operator question:

Where is the time going, and what kind of work volume are we actually asking the system to do?

Before this run, the repo already had:

explicit tool steps in HarnessRunner.run_until_pause_or_complete()
persisted run state in RunStore
raw trace events via add_trace()
summary surfaces via summary.json and trace_summary.json

But the observability was still more structural than performance-oriented. You could tell what happened. You could not tell much about how expensive or how slow it was, except by eyeballing timestamps or reading raw artifacts.

So I added a small performance layer.

The repo now records, per executed step:

started_at
finished_at
duration_ms
metrics

Those fields live on StepResult in src/harness_engineering/models.py.

RetryPolicy.call() in src/harness_engineering/runner.py now wraps each tool call with perf_counter() timing and stamps the result with start/finish metadata.

Then HarnessRunner._estimate_step_metrics() derives lightweight engineering metrics from the step output and inputs, for example:

match_count for search_mock
fact_count and output chars for extract_facts
estimated input/output token counts for draft_report
bytes written for finalize_report

Finally, build_trace_summary() in src/harness_engineering/tracing.py rolls this into:

total run duration across steps
per-tool total and average duration
estimated model token volume
provider grouping for model-generation steps
a cost status like local_or_mock or unpriced

And RunStore.build_summary() in src/harness_engineering/store.py exposes the operator-friendly version:

performance.total_step_duration_ms
performance.average_step_duration_ms
performance.slowest_step
cost.estimated_input_tokens
cost.estimated_output_tokens
cost.estimated_total_tokens
cost.total_bytes_written

That is a modest feature set. Good. Modest is better than fake precision.

Why agent performance is mostly a harness problem

When people talk about LLM latency, they often collapse everything into model speed. Model speed matters, obviously. But in agent systems it is only one component.

The real user-visible latency is more like:

request setup and network round-trip
retrieval or tool latency
model inference time
retries after flaky steps
serialization and validation costs
approval pauses
follow-up model calls caused by poor orchestration

The OpenAI latency optimization guide makes this point pretty clearly in a broader way: reduce requests, reduce output tokens, parallelize when possible, and do not default to an LLM for everything. That is not prompt advice. That is system design advice.

Likewise, Anthropic’s prompt caching docs are useful not because caching is magic, but because they highlight a harness reality: repeated prefixes and repeated context are infrastructure concerns. If your harness keeps re-sending giant stable prefixes, you have a performance architecture issue.

And on the serving side, projects like vLLM put serious effort into metrics and saturation visibility for exactly the same reason: throughput engineering requires a runtime view, not just a prompt view.

That is why I think “cost/latency/throughput engineering” belongs in a harness series and not in a generic prompting series.

The new measurement model in the repo

There are two choices in this update that I particularly like.

1. The repo measures wall-clock duration at the step boundary

The important unit is not “how long did Python take” in the abstract. It is “how long did this harness step take from the runtime’s point of view?”

That is why timing is attached to StepResult, not just printed in logs.

In RetryPolicy.call(), the harness records start time, finish time, and total elapsed time across retries. That means a flaky step does not look artificially cheap. If a tool succeeds on the second attempt, the latency cost of that retry is reflected in the step result.

That is a subtle but correct choice. Operators care about end-to-end step cost, not just successful-final-attempt cost.

2. The repo estimates workload, but does not pretend to know billing truth

I strongly prefer this over fake dollar math.

HarnessRunner._estimate_step_metrics() computes coarse token estimates for draft_report from character counts. That gives the harness a rough proxy for model work volume without claiming that character count equals provider billing count.

The trace summary then labels the cost state honestly:

local_or_mock when the draft step used the mock/local path
unpriced when it used an OpenAI-compatible provider but the harness does not know pricing

And the summary explicitly says these are engineering heuristics, not billing data.

That honesty matters.

Too many demos either track nothing, which is useless, or they invent precision, which is worse.

A real run from the updated repo

Before publishing, I ran the required checks in the repo:

cd /home/james/.openclaw/workspace/harness-engineering
make check
PYTHONPATH=src python3 -m harness_engineering.cli doctor

make check passed, including tests and secret scanning.

doctor also passed against the repo-local OpenAI-compatible endpoint with:

provider: openai_compatible
model: gemma4
base URL: http://192.168.0.16:8080/v1
status: ok
message: MODEL_OK

I then ran a repo-backed example with the new instrumentation:

PYTHONPATH=src python3 -m harness_engineering.cli start \
  --topic "Cost latency throughput engineering for agent harnesses" \
  --source-file sample_data/sources.json \
  --runs-dir .runs-post12

That run produced a real and useful result, even though it did not complete successfully.

Run ID: bd6bbf2e-59dc-4688-a629-c808039e9f39

The summary showed:

status: failed
current_step: draft_report
duration_seconds: 29
performance.total_step_duration_ms: 11397
performance.slowest_step.tool_name: draft_report
cost.estimated_total_tokens: 447

The trace summary showed:

performance.total_duration_ms: 11397
performance.by_tool.draft_report.total_duration_ms: 11397
performance.by_tool.draft_report.providers: ["openai_compatible"]
cost.estimated_input_tokens: 216
cost.estimated_output_tokens: 231
cost.estimated_total_tokens: 447
cost.status: unpriced

That is exactly the kind of evidence I want from a demo. Not just “the model was slow,” but “which step was slow, how much work volume it represented, and whether the run even reached approval.”

The useful surprise: performance instrumentation also clarifies failures

This run also exposed a real limitation that the article should not hide.

The run failed because the local reviewer returned fenced JSON instead of raw JSON, and review_markdown() in src/harness_engineering/reviewer.py treated that as invalid reviewer output.

The saved findings included:

Reviewer returned non-JSON output

That is annoying, but it is also instructive.

A performance layer is not just for “fast vs slow.” It helps separate:

model-generation cost
reviewer-format brittleness
approval wait time
policy denials
filesystem write costs

In other words, once you can attribute cost and latency at the harness-step level, you stop treating every failure as a vague “LLM issue.”

That is a big improvement in operational clarity.

What this means in practice

If I were explaining the engineering lesson in one sentence, it would be this:

Measure work at the harness boundary where the operator can act on it.

That means step-level units like:

search step duration
extraction step volume
draft-generation token estimate
review pass/fail and latency
write size and approval boundary

Those are actionable.

By contrast, a single “request took 14 seconds” metric is not very actionable in an agent system. It tells you the user waited. It does not tell you what to change.

Cost engineering is really decision engineering

There are at least four cost decisions that a harness should eventually make explicit.

1. Which steps deserve an expensive model?

In this repo, the expensive-feeling step is clearly draft_report. That is where token volume accumulates. That is also where provider-specific latency showed up in the example run.

That suggests a future design direction: cheap planner/reviewer modes versus richer drafting modes.

2. Which steps should be merged to avoid extra round trips?

The OpenAI guidance is right here: fewer requests often matters more than shaving a small number of input tokens. A harness should know when it is paying orchestration tax for unnecessary decomposition.

3. Which context is stable enough to cache?

Anthropic’s prompt caching docs matter because they point at a harness optimization, not a prose optimization. Stable prefixes, repeated instructions, and repeated examples should be handled deliberately by the runtime.

4. Where does throughput collapse under load?

This repo is still single-run and local, so it does not answer that question yet. But the vLLM metrics work is a good reminder that throughput engineering is about saturation visibility. Once a serving system approaches saturation, latency and throughput interact in non-linear ways.

A serious harness will eventually need both per-run metrics and fleet/server metrics.

What the demo proves

1. Performance instrumentation belongs in the harness, not beside it

The repo now records timing and workload information as part of StepResult, traces, and summaries. That makes performance a first-class runtime artifact.

2. Lightweight estimates are still useful if they are honest

The token counts are approximate. They are still useful because they let operators compare relative work volume across runs and steps.

3. The slow step is often obvious once you measure at the right boundary

In the repo-backed example, draft_report dominated elapsed step time. That is the kind of fact you can optimize around.

4. Performance visibility improves failure diagnosis too

The reviewer JSON-format failure was easier to reason about because the run summary separated drafting cost from review failure from approval state.

What it still does not solve

This demo is better now, but it still does not solve:

provider-accurate token accounting
provider-accurate dollar cost estimation
queueing and concurrency metrics
throughput under multiple simultaneous runs
saturation detection for the model server
streaming token latency like time-to-first-token
distinction between compute time and network time
automatic caching or prompt-prefix reuse
model routing by budget or latency target
SLO enforcement

It also does not yet instrument planning and review as separate measured external steps in the same way a production harness would if those were independent model calls with strict budgets.

So this is not a cost platform. It is a practical step toward cost-aware harness design.

Honest limitations

I see four main limitations in the current implementation.

First, duration_ms is step wall-clock time, not distributed tracing. It is good enough for this demo, but it does not break down network, serialization, provider, and local processing separately.

Second, the token estimates are deliberately coarse. They come from character counts, not tokenizer truth.

Third, the repo still lacks a throughput story above one run at a time. You can reason about one run’s shape, but not yet about queueing, saturation, or multi-run contention.

Fourth, the reviewer path still has brittle parsing behavior with fenced JSON. That is a real harness problem, and the instrumentation helps expose it, but it is not fixed by this post’s change set.

The broader lesson

If you want agents that are economically usable, you need to stop asking only, “Did the model answer well?”

You also need to ask:

How many requests did the harness make?
Which step burned the most wall-clock time?
How much generated output did we really ask for?
Which steps could be cheaper or cached?
Where did the run fail before the user got value?
Are we measuring the unit of work that an operator can actually optimize?

That is why I keep coming back to the same thesis in this series.

Prompt engineering matters. But once you are building a real system, the bigger wins usually come from the harness:

fewer round trips
clearer step boundaries
explicit approval pauses
persisted summaries
runtime policy
traceable retries
honest performance counters

The updated demo repo still keeps things intentionally small. That is a feature, not a bug. It lets the performance story stay legible.

And legibility is underrated.

If your agent stack cannot explain where the time and work went, then you do not really have performance engineering yet. You just have waiting.

References

Live repo: https://github.com/67ailab/harness-engineering
OpenAI, “Latency optimization”: https://developers.openai.com/api/docs/guides/latency-optimization
Anthropic, “Prompt caching”: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
vLLM metrics documentation: https://docs.vllm.ai/en/stable/usage/metrics/

What changed in the repo since the previous post#

Why agent performance is mostly a harness problem#

The new measurement model in the repo#

1. The repo measures wall-clock duration at the step boundary#

2. The repo estimates workload, but does not pretend to know billing truth#

A real run from the updated repo#

The useful surprise: performance instrumentation also clarifies failures#

What this means in practice#

Cost engineering is really decision engineering#

1. Which steps deserve an expensive model?#

2. Which steps should be merged to avoid extra round trips?#

3. Which context is stable enough to cache?#

4. Where does throughput collapse under load?#

What the demo proves#

1. Performance instrumentation belongs in the harness, not beside it#

2. Lightweight estimates are still useful if they are honest#

3. The slow step is often obvious once you measure at the right boundary#

4. Performance visibility improves failure diagnosis too#

What it still does not solve#

Honest limitations#

The broader lesson#

References#