Anatomy of an Agent Harness

The live demo repo for this series is 67ailab/harness-engineering, and this post stays anchored to the code that exists there today. I did not add a new repo capability for this article. The point of this installment is to dissect the current harness as it actually stands: what lives in src/harness_engineering/, how the pieces fit together, and which parts are carrying the reliability burden.

That matters because “agent” is now a dangerously overloaded word. Many teams still mean either a model that can call functions or a prompt loop with some memory and tool wrappers. Those are ingredients, not a runtime anatomy.

If you want a system that can pause before risky actions, survive interruptions, keep a trace, and explain itself after something goes wrong, you need a harness with clearly separated responsibilities. In the demo repo, those responsibilities are visible in code rather than hidden in framework abstraction.

The files that matter most for this post are:

src/harness_engineering/cli.py
src/harness_engineering/runner.py
src/harness_engineering/tools.py
src/harness_engineering/store.py
src/harness_engineering/models.py
src/harness_engineering/provider.py
src/harness_engineering/reviewer.py
src/harness_engineering/tracing.py

Those files are small enough to read directly, which makes this repo a useful teaching specimen. It is not pretending to be a production platform. It is showing the bones.

The shortest useful definition of a harness

A harness is the runtime system around the model. In this repo, that runtime is not one giant “agent class.” It is a set of boundaries:

the CLI starts and controls runs
the runner owns workflow progression
the tool registry defines available actions
the store persists run state and trace artifacts
the models define the durable data shape
the provider/reviewer layer lets model-backed planning and review plug in without taking over the architecture
the tracing helper records what happened as first-class events

That separation is the anatomy. It is also why this code is more instructive than a notebook demo: you can point at each responsibility and ask whether it has the right home.

Start with the command surface, because operations begin there

The entry point in src/harness_engineering/cli.py tells you a lot about the system before you even inspect the workflow logic. build_parser() defines the current subcommands:

start
inspect
approve
resume
list
interactive
doctor

That command set already implies a different operational model than “send prompt, get answer.”

A normal prompt-centric demo usually has one entrypoint: run the thing. This repo has separate commands for inspection, approval, and resumption because a run is treated as a long-lived object rather than a disposable response.

A few functions in cli.py are worth reading directly:

cmd_start() loads source documents, creates a run via HarnessRunner.create_run(), then advances it with run_until_pause_or_complete().
cmd_inspect() loads a stored run and prints the serialized state JSON.
cmd_approve() flips approval state through HarnessRunner.approve().
cmd_resume() re-enters the workflow after a pause.
cmd_interactive() exposes the approval boundary to a human in one session.
cmd_doctor() checks provider/model connectivity.

This is the first design lesson: the CLI is part of the harness contract. If a system claims to support approvals and resumability, there should be an explicit operational surface for both.

`RunState` is the real center of gravity

If you want to know what a harness truly believes about the world, look at its persisted state model.

In src/harness_engineering/models.py, the key dataclass is RunState:

run_id
topic
status
created_at
updated_at
current_step
requires_approval
approved
pending_action
plan
source_documents
artifacts
trace
step_results

That is the actual anatomy in data form. If something matters to the control flow, it should usually become explicit state.

A few aspects are especially important:

1. Approval is state, not chat etiquette

The pair requires_approval and pending_action, plus the approved boolean, turns human review into something the runtime can reason about. It is no longer an implied social convention.

That sounds almost trivial until you compare it with the usual pattern in thin agent wrappers:

model suggests a risky action
app displays a message like “Should I proceed?”
a human responds somewhere in chat
the system tries to reconstruct intent from message history

That is fragile. RunState makes the opposite choice.

2. Step results are distinct from trace events

The repo keeps both step_results and trace:

step_results answer: what were the outcomes of concrete tool executions?
trace answers: what notable events happened over the life of the run?

3. Artifacts are part of the state contract

The artifacts dictionary stores intermediate and final outputs like matches, facts, draft markdown, review output, planner identity, and the finalized report path. That makes the harness introspectable after the fact.

If you inspect a stored run, you can see not just that it failed or paused, but what it had already produced.

The runner is the workflow brain, not the model

The heart of the repo is HarnessRunner in src/harness_engineering/runner.py.

This class owns the workflow semantics through:

create_run()
_execute()
run_until_pause_or_complete()
approve()
resume()

That method set is worth paying attention to because it encodes a strong architectural preference: the model can contribute to planning and drafting, but it does not own control flow.

`create_run()` establishes a run as durable work

create_run() does three important things:

creates a new RunState
derives a plan through create_plan_from_env()
persists the run immediately and records a run_created trace

This is a small design choice with big downstream effects. The run exists before useful work happens, so the system can be inspected even if it fails early.

`_execute()` is the control wrapper around tools

_execute() is where the harness adds operational behavior to plain function calls:

look up the tool in the registry
emit tool_start
call the tool through retry logic
append a structured StepResult
emit either tool_ok or tool_error
save state after execution

Model APIs can tell you that a tool call was requested. They do not automatically give you persistence semantics, retry policy, or artifact storage. _execute() is where the application becomes responsible.

`run_until_pause_or_complete()` is the state machine in plain sight

This method is the cleanest anatomy diagram in the whole repo. The workflow currently moves through:

search_mock
extract_facts
draft_report
finalize_report
done

What matters is how the transitions are handled. At the draft_report stage, the runner stores draft_markdown, then calls review_from_env() and records a draft_reviewed event. If review passes, the runner does not immediately write the final artifact. Instead it sets:

current_step = "finalize_report"
requires_approval = True
pending_action = "finalize_report"
status = "waiting_approval"

Then it stops.

The write boundary is explicit, durable, and inspectable. When approval is later granted, approve() updates the state and resume() re-enters run_until_pause_or_complete().

The tool registry is where action contracts become legible

In src/harness_engineering/tools.py, the central abstractions are:

Tool
ToolRegistry
default_registry()

The Tool dataclass includes:

name
description
input_schema
risky
handler

A good harness should make tool availability and risk boundaries inspectable without needing to parse prompt text. Here, default_registry() registers:

search_mock
extract_facts
draft_report
finalize_report
flaky_echo

finalize_report is marked risky=True. Everything else is non-risky. That matters because it is the start of policy. Not a full policy engine, but the right shape.

The individual tool handlers are intentionally simple:

search_mock() scores local documents lexically against topic words.
extract_facts() slices early sentences from matched documents into short fact strings.
draft_report() calls build_report_markdown() with or without a configured model client.
finalize_report() writes markdown to disk.
flaky_echo() exists to test retry behavior.

Two practical observations follow.

First, boring tools are a feature in a teaching repo. This repo keeps the tool layer simple enough that the harness is still the main character.

Second, tools are application functions first, model affordances second. Even if future posts add an MCP-facing adapter or provider-neutral tool interface, these handlers will still need application-side semantics around retries, writes, persistence, and policy.

Storage is what turns a run into durable work

src/harness_engineering/store.py is short, but it has outsized importance.

RunStore provides:

run_dir(run_id)
state_path(run_id)
trace_path(run_id)
save(state)
load(run_id)
list_runs()
latest_run_id()

The implementation writes state.json and trace.json under .runs/<run_id>/.

That directory layout is humble, but it is exactly why the demo is inspectable. You can run the CLI, pause before approval, and open the saved artifacts directly.

Without a store like this, all of the following become much harder:

resuming after interruption
auditing prior actions
exposing pending approval clearly
comparing multiple runs
writing tests around run lifecycle behavior

You can also see how little code it takes to get meaningful checkpointing. You do not need a giant orchestration platform to start behaving like a real system.

Tracing is separate on purpose

src/harness_engineering/tracing.py contains one small helper: add_trace().

At first glance, that file looks almost too small to mention. But the design choice behind it is good: trace creation is centralized, timestamped, and distinct from general artifact updates.

The events currently emitted include:

run_created
tool_start
tool_ok
tool_error
draft_reviewed
approval_required
approval_granted
run_resumed
run_completed

That event vocabulary is already enough to reconstruct a causal story about a run.

And the repo gives a useful real-world example of why that matters.

A verified limitation from this run: model-backed review is still brittle

Before writing this article, I verified the demo with:

PYTHONPATH=src python3 -m harness_engineering.cli doctor
PYTHONPATH=src python3 -m harness_engineering.cli start --topic 'Agent harness anatomy' --source-file sample_data/sources.json

The doctor command succeeded against the repo-local OpenAI-compatible endpoint, with provider: openai_compatible, model_name: gemma4, and a successful MODEL_OK response.

But the live run did not reach approval. It failed in the review stage.

The stored run under .runs/e6c5f518-57d9-4f56-9c38-1f5ae029fa74/ showed why: review_from_env() in src/harness_engineering/reviewer.py expects raw JSON from the reviewer model, but the configured local model returned fenced JSON markdown. The runner then recorded:

status: "failed"
current_step: "draft_report"
a draft_reviewed trace event with passed: false
findings that included Reviewer returned non-JSON output

That is not a reason to discard the harness. It is a good example of why the harness exists. The failure is visible, persisted, and explainable.

A thin “agent framework” demo might just shrug and tell you the agent failed. Here, you can inspect the exact review artifact and see the mismatch between expected and returned structure.

The provider boundary is intentionally narrow

src/harness_engineering/provider.py handles model configuration and OpenAI-compatible interaction through:

ModelConfig
load_dotenv()
load_model_config()
OpenAICompatibleClient
build_report_markdown()
create_client_from_env()
doctor_check()

This file demonstrates one of the healthier patterns in the repo: provider code is boxed in.

The provider layer knows how to:

load environment-based configuration
talk to /models
send a chat completion request
build a report draft from supplied facts
validate connectivity through doctor_check()

It does not own workflow transitions, trace persistence, approval logic, or artifact policy.

That is the right relationship.

A lot of projects accidentally let the provider abstraction swell until it starts dictating application semantics. This repo keeps the provider as an adapter, which makes the rest of the runtime easier to reason about.

The reviewer module is a quiet preview of future scaling problems

src/harness_engineering/reviewer.py contains:

build_plan()
review_markdown()
create_plan_from_env()
review_from_env()

The repo already separates:

planning from execution
review from drafting
provider-backed behavior from mock behavior

That separation is useful, but it also creates contract pressure. As soon as one component returns output that the next component expects in a stricter format, reliability starts depending on how carefully you define interface boundaries.

The fenced-JSON review failure is the obvious example. In later, larger systems, this is exactly where teams end up needing:

stronger schemas
output validation
tolerant parsers
typed envelopes for agent-to-agent handoff
retry policies specific to structured-output failures

What the demo proves

This repo is still compact, but it proves several important design points cleanly.

1. A harness should have named parts, not one mystical loop

The split across CLI, runner, tools, store, models, tracing, provider, and reviewer makes the design inspectable. That is already better than many agent demos, where everything is buried inside framework callbacks.

2. The state model should be durable and explicit

RunState is not elegant theater. It is operational truth. Because approval, current step, artifacts, and prior outcomes are stored explicitly, the system can pause and resume without improvising.

3. Tool contracts belong in application code

Tool, ToolRegistry, and default_registry() make the available actions visible and reviewable. The risky flag on finalize_report shows the start of a policy boundary.

4. Traces are not a luxury add-on

The .runs/<run_id>/trace.json artifact means failed runs are diagnosable. That matters more than prompt cleverness once real side effects appear.

5. Provider optionality is healthier than provider centrality

The harness works in mock mode and can optionally use the repo-local OpenAI-compatible endpoint. That lets the system remain runnable and testable even when external model behavior is noisy.

What it still does not solve

The anatomy is solid for a demo, but it has clear limits.

1. The workflow is a hand-written linear state machine

That is fine here, and arguably better than premature graph complexity. But it does mean alternate branches, compensating actions, and richer orchestration policies are still manual.

2. Tool schemas are descriptive, not enforced

input_schema in Tool is informative, but the repo is not yet validating tool inputs or outputs with a runtime schema system. That becomes more important as tool count rises.

3. Approval policy is binary and local

The harness can pause before a risky write, but it does not yet support richer policies like role-based authorization, multiple approval classes, environment-specific restrictions, or sandbox enforcement.

4. Reviewer output handling is brittle

The current review_markdown() path expects clean JSON from the model. Real providers frequently return fenced JSON or extra prose. The verified failure in this run shows that the repo still needs more robust structured-output handling.

5. Persistence is local-only

Saving runs under .runs/ is excellent for clarity and local durability, but it is not a distributed execution substrate. There is no external datastore, leasing, queueing, or multi-worker coordination.

6. Observability is file-based, not platform-grade

The traces are useful, but there is no metrics backend, no dashboarding, no span correlation, and no evaluation harness tied into trace replay yet.

Why this anatomy is more important than most agent marketing

The industry spends a lot of energy on labels like autonomous agent, multi-agent architecture, AI workflow platform, and MCP-native runtime. Those terms are not useless, but they blur the engineering questions that actually matter.

When a run fails, can you inspect it? When a risky action is pending, can you prove the system is waiting? When a tool misbehaves, can you see the attempt count and artifact state? When a provider returns malformed structure, do you get a visible failure mode instead of a ghost story?

This repo gives good answers to those questions precisely because its anatomy is plain.

Before talking about MCP adapters, richer orchestration, durable execution upgrades, or policy engines, it helps to identify the minimum organs of a useful harness. In this demo, those organs are already there:

command surface
state model
runner
tool contracts
persistence
traces
provider boundary
review boundary
approval gate

You can argue about implementation details. You can extend every one of them. But if one of those organs is missing entirely, the system is usually still in demo land.

What changed in the repo for this post

No repo code changed for this article. The current public baseline was sufficient for an anatomy post.

That said, verification for this run surfaced a meaningful limitation in the existing code: the reviewer path in src/harness_engineering/reviewer.py is currently too strict about raw JSON-only responses from the local model. I am documenting that limitation here rather than pretending the run succeeded cleanly. That issue is likely worth addressing before a later post leans harder on model-backed review behavior.

The practical takeaway

If you are building an agent system and cannot quickly point to:

where run state lives
how approvals are represented
how traces are recorded
how tool contracts are defined
how failures are resumed or diagnosed

then you probably do not have a harness yet. You have an agent-shaped demo.

The good news is that the anatomy does not need to be massive before it becomes useful. This repo proves that a few carefully separated files can already buy you clarity, durability, and operational honesty.

That is the shift. Not “make the prompt smarter.” Make the runtime legible.

References

67 AI Lab, harness-engineering repository: https://github.com/67ailab/harness-engineering
Model Context Protocol Specification (2025-06-18): https://modelcontextprotocol.io/specification/2025-06-18
OpenAI API docs, “Using tools”: https://developers.openai.com/api/docs/guides/tools
Anthropic API docs, “Messages”: https://docs.anthropic.com/en/api/messages

The shortest useful definition of a harness#

Start with the command surface, because operations begin there#

RunState is the real center of gravity#

1. Approval is state, not chat etiquette#

2. Step results are distinct from trace events#

3. Artifacts are part of the state contract#

The runner is the workflow brain, not the model#

create_run() establishes a run as durable work#

_execute() is the control wrapper around tools#

run_until_pause_or_complete() is the state machine in plain sight#

The tool registry is where action contracts become legible#

Storage is what turns a run into durable work#

Tracing is separate on purpose#

A verified limitation from this run: model-backed review is still brittle#

The provider boundary is intentionally narrow#

The reviewer module is a quiet preview of future scaling problems#

What the demo proves#

1. A harness should have named parts, not one mystical loop#

2. The state model should be durable and explicit#

3. Tool contracts belong in application code#

4. Traces are not a luxury add-on#

5. Provider optionality is healthier than provider centrality#

What it still does not solve#

1. The workflow is a hand-written linear state machine#

2. Tool schemas are descriptive, not enforced#

3. Approval policy is binary and local#

4. Reviewer output handling is brittle#

5. Persistence is local-only#

6. Observability is file-based, not platform-grade#

Why this anatomy is more important than most agent marketing#

What changed in the repo for this post#

The practical takeaway#

References#