Tracing, Observability, and Evals for Agent Systems

The live demo repo for this series is 67ailab/harness-engineering, and for this post I did change the repo before publishing. The new capability shipped in commit 85c762c, which adds two concrete things the repo was missing:

a persisted trace-summary surface for every run
a lightweight eval runner with trace-aware fixtures

The key changes are in src/harness_engineering/tracing.py, src/harness_engineering/store.py, src/harness_engineering/cli.py, and the new src/harness_engineering/evals.py module, plus starter fixtures in sample_data/evals/basic.json.

That matters because a lot of agent writing still treats observability as an afterthought and evals as a benchmark spreadsheet. In practice, most production pain shows up somewhere else:

the agent paused but you do not know why
a run failed but the trace is too raw to inspect quickly
retries happened but nobody can see where
the workflow reached the wrong state even though the text output looked plausible
a demo seems fine until you ask whether the same runtime behavior happens every time

My practical claim in this post is simple: for agent systems, tracing and evals should mostly be about runtime behavior, not just output quality. If you cannot see the path a run took, and if you cannot assert that the run emitted the right events and reached the right control boundaries, you are not evaluating the harness yet. You are grading prose while the system underneath stays opaque.

This repo is still intentionally small. Good. Small code makes the observability story visible.

What changed in the repo since the previous post

Post 7 made human approval more legible by adding pending_action_details and a pending CLI command. That improved the operator-facing approval boundary. But it also made the next gap obvious.

The repo already had traces. Specifically:

add_trace() in src/harness_engineering/tracing.py
raw trace persistence to .runs/<run_id>/trace.json through RunStore.save() in src/harness_engineering/store.py
history inspection through cmd_history() in src/harness_engineering/cli.py

That was enough to say, truthfully, that the repo stores execution traces.

It was not enough to support a strong post about observability and evals.

Why? Because raw traces alone are only half the job. Operators need a compact summary, and engineers need a repeatable way to assert that traces contain the events that define correct harness behavior.

So for Post 8 I added two concrete repo capabilities.

1. Compact trace summaries

src/harness_engineering/tracing.py now includes build_trace_summary(state).

That function rolls a saved run into a small observability payload containing:

run status and current step
total trace-event count
step count
event counts by event name
tool counts by tool name
retry/attempt counts by tool
first and last event timestamps
latest event name
approval-gate state
reviewer result
final artifact presence
any failed step results

That summary is persisted on every save because RunStore.save() in src/harness_engineering/store.py now writes:

.runs/<run_id>/trace_summary.json

RunStore also gained trace_summary_path(), and build_summary() now exposes that path alongside state, trace, summary, and memory.

On the CLI side, src/harness_engineering/cli.py now exposes:

PYTHONPATH=src python3 -m harness_engineering.cli trace-summary --latest

That is a small addition, but I think it is exactly the right kind of small. Observability should make the runtime legible without forcing every operator to read the full raw event stream first.

2. Lightweight trace-aware evals

The repo now also has a new file: src/harness_engineering/evals.py.

The important functions are:

load_eval_fixtures(path)
run_eval_case(fixture, fixtures_path, runs_dir)
run_eval_suite(fixtures_path, runs_dir)

The fixture format in sample_data/evals/basic.json is deliberately simple. Each case can declare things like:

topic
source_file
auto_approve
expected_status
expected_current_step
required_events
min_trace_events
expect_final_report

That means the evals are not trying to grade writing style. They are checking harness behavior.

A fixture can now ask questions like:

Did the run pause at waiting_approval?
Did it reach done after approval?
Did the trace include approval_required, approval_granted, run_resumed, and run_completed?
Did the run produce a final artifact when it was supposed to?

That is the kind of eval surface I trust more than vague claims about “the agent looked good.”

The CLI exposes this through:

PYTHONPATH=src python3 -m harness_engineering.cli evals

Again, this is intentionally modest. It is not a full eval platform. It is a correct first move.

Why raw traces are not enough

OpenTelemetry’s tracing docs define traces as the path of a request through an application. I like that framing because it moves attention away from single events and toward the sequence and structure of work.

That matters even more for agent systems than for ordinary request/response apps.

In a normal API service, you might ask:

which service handled the request
which database query was slow
where the error came from

In an agent harness, you also need to ask:

which tool calls happened and in what order
whether the system paused at the right control boundary
whether retries occurred
whether the human approval gate actually fired
whether the run resumed from the same state or silently restarted
whether the final side effect happened only after approval

A raw trace file can contain all of that, but it is still too low-level for fast operational understanding.

That is why build_trace_summary(state) in src/harness_engineering/tracing.py is useful. It does not replace the raw trace. It gives you the smallest roll-up that still answers the operator question:

what path did this run take, and where is it now?

I think that is a good observability design rule generally. Logs are for detail. Summaries are for orientation. You want both.

The trace vocabulary in the current demo

The nice thing about this repo is that the trace vocabulary is still small enough to reason about directly.

Across HarnessRunner.create_run(), HarnessRunner._execute(), HarnessRunner.run_until_pause_or_complete(), HarnessRunner.approve(), and HarnessRunner.resume() in src/harness_engineering/runner.py, the runner emits events like:

run_created
tool_start
tool_ok
tool_error
draft_reviewed
approval_required
approval_still_required
approval_granted
run_resumed
run_completed

That is already enough to reconstruct the control story of a run.

For example, the happy path of an approval-gated run should look roughly like this:

run_created
tool_start / tool_ok for search_mock
tool_start / tool_ok for extract_facts
tool_start / tool_ok for draft_report
draft_reviewed
approval_required
approval_granted
run_resumed
tool_start / tool_ok for finalize_report
run_completed

The trace summary now compresses exactly that kind of story into counts, latest event, approval status, and failures.

That sounds obvious. It is still more honest than a lot of agent demos, because the repo exposes the path instead of skipping from prompt to final output.

A verified run from the live repo

Before writing this post, I verified the repo behavior in /home/james/.openclaw/workspace/harness-engineering.

First I ran the required checks:

make check
PYTHONPATH=src python3 -m harness_engineering.cli doctor

Those passed.

make check ran the test suite and the secret scan.
The repo now passes 27 tests.
scripts/secret_scan.py reported no obvious secrets in tracked files.
doctor succeeded against the repo’s configured local OpenAI-compatible endpoint with:
- provider: openai_compatible
- model: gemma4
- base URL: http://192.168.0.16:8080/v1
- status: ok
- message: MODEL_OK

Then I exercised the new eval runner:

PYTHONPATH=src python3 -m harness_engineering.cli evals

That ran the default suite in sample_data/evals/basic.json and returned two passing cases.

Eval case 1: `approval-pause-baseline`

This fixture checks the pause boundary.

It expects:

status = waiting_approval
current_step = finalize_report
required trace events including approval_required
at least a minimum trace-event count

The actual trace summary reported:

trace_events: 9
steps: 3
event counts including run_created, tool_start, tool_ok, draft_reviewed, and approval_required
approval.required: true
approval.pending_action: finalize_report
no failures

That is exactly what I want from a harness eval. It is testing the workflow contract.

Eval case 2: `end-to-end-complete`

This fixture sets auto_approve: true, so it checks the full run.

It expects:

status = completed
current_step = done
events including approval_granted and run_completed
a final report artifact on disk

The actual trace summary reported:

trace_events: 14
steps: 4
event counts including approval_required, approval_granted, run_resumed, and run_completed
approval.required: false
approval.approved: true
a concrete final_report_path
no failures

That is a much stronger claim than “the agent eventually wrote markdown.” It shows that the runtime traversed the intended states.

Why trace-aware evals matter more than style scoring

There is nothing wrong with output-quality evals. You still care whether the generated text is coherent and accurate.

But a harness-oriented system fails in ways that output-only grading will miss.

A run can produce a plausible final report while still violating the runtime contract:

the risky tool might have fired without an approval gate
the run might have restarted instead of resumed
the retry path might be exploding silently
the reviewer might have failed but the operator never saw it
the wrong step might have been marked complete

None of those are mainly language-quality problems. They are harness problems.

That is why run_eval_case() in src/harness_engineering/evals.py is shaped around run state and trace evidence.

The internal helper in that function constructs checks like:

status
current_step
event:<event_name>
min_trace_events
final_report_exists

Then it evaluates pass/fail by aggregating those checks, not by asking a model whether the output “seems good.”

This is much closer to how I think agent-system evals should start.

Start by verifying:

control boundaries
state transitions
artifact side effects
retry behavior
human-review paths

Then add quality evals on top.

Not the other way around.

Where this aligns with bigger frameworks

The repo is still a tiny local harness. It is not pretending to be LangGraph, the OpenAI Agents SDK, or a full OpenTelemetry implementation.

But the direction is consistent with what those systems emphasize.

OpenAI Agents SDK

The OpenAI Agents SDK docs repeatedly frame runs as resumable stateful workflows with built-in tracing, human-in-the-loop interruptions, and evaluation surfaces. Their human-in-the-loop docs make an especially important point: when a tool needs approval, the run pauses, returns interruptions, and later resumes from the same state rather than starting over.

That is the same architectural instinct this repo now demonstrates in miniature.

LangGraph

LangGraph’s overview docs describe the runtime as being about durable execution, human-in-the-loop, persistence, and debugging/observability via LangSmith.

Again, the implementation is far richer than this repo. But the conceptual stack matches:

execution path matters
state transitions matter
operator inspection matters
evaluation belongs near runtime behavior, not separate from it

OpenTelemetry

OpenTelemetry’s traces docs describe traces as the path of a request through an application, with structured events and hierarchy giving you the big picture of what happened.

This repo does not have nested spans or cross-service correlation. It should not fake that.

What it does now have is a small, structured trace vocabulary plus a summary layer. That is enough to teach the right habit: a run is not just an output. It is a path.

The practical observability rule for agent systems

If I had to compress this whole post into one engineering rule, it would be this:

observe the control flow, not just the content.

For ordinary software, people eventually learned that “did the endpoint return 200?” was not enough. You also need latency, errors, spans, retries, downstream dependencies, and state transitions.

Agent systems need the same maturity step.

Do not only ask:

was the answer fluent?
was the final markdown acceptable?

Also ask:

did the risky boundary trigger?
did the run pause where policy said it should?
did resume happen after approval rather than before it?
did the expected trace events appear?
did the artifact get written only after the gate?
did retries happen, and how many?

That is the difference between observing a chatbot and observing a workflow system.

What the demo proves

1. Raw traces become much more useful once you add a compact summary

build_trace_summary(state) is a small function, but it materially improves legibility. Operators can inspect run shape quickly before diving into trace.json.

2. Harness evals should verify runtime contracts

run_eval_case() and run_eval_suite() show that even a simple fixture format can test the things that matter operationally: state, transitions, events, and side effects.

3. A tiny local repo can still demonstrate serious observability ideas

This repo is not enterprise infrastructure. Good. It still now demonstrates a clean progression:

raw trace persistence
operator-facing summary
trace-aware fixtures
CLI surfaces for all three

That is enough to make the lesson real.

4. Approval workflows are especially good candidates for trace-based evals

The two starter fixtures prove the point nicely. One checks that the run pauses. One checks that approval plus resume plus finalize actually happen. That is a concrete, testable runtime contract.

What it still does not solve

This part matters more than the happy section.

1. The trace model is still flat and local

The repo stores event lists, not a richer span hierarchy. There is no cross-process correlation, no distributed context propagation, and no external telemetry backend.

2. The trace summary does not yet include timing or cost

build_trace_summary(state) currently reports counts and high-level state, but not per-step durations, model latency, or estimated token/cost data.

That is a real missing piece, and it will matter later in this series when we talk about cost, latency, and throughput engineering.

3. The eval fixtures are still small and deterministic

The current fixtures are good sanity checks. They are not a replay framework, not a regression corpus across many scenarios, and not a statistical eval suite.

4. Output quality is still only lightly covered

These evals check runtime behavior. They do not deeply evaluate whether the generated report is semantically strong, complete, or aligned with a domain rubric.

5. The repo still depends on local file persistence

Everything remains under .runs/. There is no external observability store, no dashboard, and no team-facing run browser.

Again: that is fine. It is a demo repo, not a platform. But it should not pretend otherwise.

What I think most agent teams should do next

If you are building an agent harness today, I think the right order of operations is:

emit structured events for the important control boundaries
persist enough state to inspect and resume runs
build a compact summary over the raw trace
write evals that assert runtime behavior
only then obsess over fancier dashboards

Too many teams jump from “we have logs” straight to “we need a tracing product” without first defining the event vocabulary that actually matters for their harness.

In this repo, that vocabulary is still tiny, which is part of why it is useful:

run lifecycle
tool lifecycle
review result
approval gate
resume/completion

That is already enough to drive both observability and evals.

The same principle scales upward. Better tools help, but the first job is deciding what should be observable.

The practical takeaway

A lot of agent discussions still orbit around prompts, personalities, and output polish.

Those things matter. They are not what saves you when a run pauses forever, resumes incorrectly, skips a gate, or quietly fails its workflow contract.

Tracing and evals become useful when they tell you whether the system behaved correctly, not just whether the last paragraph sounded smart.

That is why I wanted this repo change in place before publishing Post 8. The live demo now has:

persisted raw traces
persisted trace summaries
a trace-summary CLI surface
lightweight trace-aware eval fixtures
a real example of approval-path verification

That still does not make it a full observability stack. It does make the article honest.

And that is the bar I care about in this series.

References

67 AI Lab, harness-engineering repository: https://github.com/67ailab/harness-engineering
OpenAI Agents SDK documentation: https://openai.github.io/openai-agents-python/
OpenAI Agents SDK documentation, “Human-in-the-loop”: https://openai.github.io/openai-agents-js/guides/human-in-the-loop/
OpenAI API documentation, “Guardrails and human review”: https://developers.openai.com/api/docs/guides/agents/guardrails-approvals
LangGraph overview: https://docs.langchain.com/oss/python/langgraph/overview
OpenTelemetry documentation, “Traces”: https://opentelemetry.io/docs/concepts/signals/traces/
Temporal documentation: https://docs.temporal.io/

What changed in the repo since the previous post#

1. Compact trace summaries#

2. Lightweight trace-aware evals#

Why raw traces are not enough#

The trace vocabulary in the current demo#

A verified run from the live repo#

Eval case 1: approval-pause-baseline#

Eval case 2: end-to-end-complete#

Why trace-aware evals matter more than style scoring#

Where this aligns with bigger frameworks#

OpenAI Agents SDK#

LangGraph#

OpenTelemetry#

The practical observability rule for agent systems#

What the demo proves#

1. Raw traces become much more useful once you add a compact summary#

2. Harness evals should verify runtime contracts#

3. A tiny local repo can still demonstrate serious observability ideas#

4. Approval workflows are especially good candidates for trace-based evals#

What it still does not solve#

1. The trace model is still flat and local#

2. The trace summary does not yet include timing or cost#

3. The eval fixtures are still small and deterministic#

4. Output quality is still only lightly covered#

5. The repo still depends on local file persistence#

What I think most agent teams should do next#

The practical takeaway#

References#