Durable Execution Is the Difference Between a Demo and a System

The live demo repo for this series is 67ailab/harness-engineering, and for this post I did change the repo before publishing. The new capability shipped in commit 9612b58, which adds persisted run summaries plus replay-oriented history inspection to the existing approval-gated harness. The key changes are in src/harness_engineering/store.py and src/harness_engineering/cli.py.

That addition matters because durable execution is where most agent demos quietly stop being honest. It is easy to show a model calling tools in one uninterrupted run. It is much harder to explain what happens when execution pauses for approval, the process dies, the machine reboots, the reviewer returns malformed output, or an operator needs to understand what state the run is actually in.

My practical claim in this post is simple: if an agent cannot pause, survive interruption, and resume with inspectable state, you do not have an operational system yet. You have a good-looking foreground task.

The current repo is still intentionally small. Good. Small code makes the durability story visible instead of hiding it inside framework abstractions.

What changed in the repo since the previous post

Post 4 focused on orchestration shape and added workflow export in src/harness_engineering/workflow.py. For Post 5, the repo needed a more concrete durable-execution surface.

The main new logic is in RunStore inside src/harness_engineering/store.py:

summary_path()
build_summary()
history()
updated save() logic that now writes summary.json alongside state.json and trace.json

The command surface changed in src/harness_engineering/cli.py too. The CLI now includes:

summary
history

Those commands sit next to the existing start, inspect, approve, and resume flow.

This sounds small. It is not. It changes the demo from “yes, state is technically on disk” to “yes, an operator can inspect run status, next actions, trace history, and resume metadata without spelunking raw JSON.” That is a real step toward a durable harness.

What durable execution actually means

This term gets abused.

Durability does not mean “we wrote a file once.” It means the runtime has enough persisted state and execution history to continue coherently after interruption. Depending on the system, that can mean different levels of rigor.

Temporal’s workflow docs are useful here because they are explicit: a workflow execution emits commands, processes events, and records an event history. On resume, Temporal replays workflow code against that history to reconstruct the same state, which is why determinism matters so much in Temporal-style systems.

LangGraph’s persistence docs describe a somewhat different but related model: checkpoints are saved at execution boundaries, tied to a thread, so a graph can be interrupted, inspected, resumed, and even replayed or forked later.

Those are more sophisticated durability models than this repo implements. But they make the right engineering point: pause/resume is not a UI flourish. It is a runtime property built on persisted state plus a coherent recovery model.

The harness-engineering repo is still a lightweight local harness. It is not pretending to be Temporal. It is not a general graph scheduler. It is not a distributed workflow engine. But it now demonstrates the minimum ideas cleanly enough to discuss them without hand-waving.

Where the current demo’s durability lives

The core execution logic is still in HarnessRunner inside src/harness_engineering/runner.py.

The important functions are:

HarnessRunner.create_run()
HarnessRunner.run_until_pause_or_complete()
HarnessRunner.approve()
HarnessRunner.resume()
HarnessRunner._execute()

And the persisted model is RunState in src/harness_engineering/models.py.

That structure already mattered before this post. A RunState carries fields like:

run_id
status
current_step
requires_approval
approved
pending_action
artifacts
trace
step_results

That is durable-execution material, not just app state. Those fields encode where the run is, what it already did, what it is waiting on, and what artifacts it produced.

The actual pause point is the approval gate in HarnessRunner.run_until_pause_or_complete(). After draft_report succeeds and review_from_env() passes, the runner does not execute finalize_report immediately. Instead it sets:

current_step = "finalize_report"
requires_approval = True
pending_action = "finalize_report"
status = "waiting_approval"

Then it persists state and stops.

That is the right design instinct. Approval is represented as explicit runtime state, not as a polite suggestion in chat.

Why the new summary/history surface matters

Before this repo change, the harness already persisted state.json and trace.json under .runs/<run_id>/. That was real, but slightly too raw for the article the plan called for.

A durable system should make three questions cheap to answer:

What happened?
What state is the run in now?
What should the operator do next?

The new RunStore.build_summary() function in src/harness_engineering/store.py is meant to answer exactly those questions.

For every saved run, the repo now writes .runs/<run_id>/summary.json containing:

run status and current step
whether approval is still required
planner and reviewer used
step counts and total attempts
pause/resume/approval counters
final artifact paths if present
can_resume
concrete next_commands

That last part is underrated. Durable systems are not just about storing state; they are about storing operationally useful state.

The new CLI entry points in src/harness_engineering/cli.py make that visible:

PYTHONPATH=src python3 -m harness_engineering.cli summary --latest
PYTHONPATH=src python3 -m harness_engineering.cli history --latest
PYTHONPATH=src python3 -m harness_engineering.cli history --latest --event approval_required
PYTHONPATH=src python3 -m harness_engineering.cli history --latest --tail 5

That is not enterprise workflow software. But it is exactly the kind of concrete inspection surface most toy agent demos omit.

The difference between resumable state and real replay

This is the most important design distinction in the post.

The repo is now resumable.

It is not fully replay-deterministic in the Temporal sense.

Those are related, but not identical, properties.

In this repo, resume works because the runner persists enough explicit state to know what step comes next. HarnessRunner.resume() loads the saved RunState, appends a run_resumed trace event, and hands control back to run_until_pause_or_complete(). Since current_step and approval flags are already saved, the harness can continue from the right place.

That is good. It is practical. It is enough for a local approval-gated demo.

But it is not the same as event-sourced replay where the whole workflow is reconstructed from an authoritative event history and replay-safe code path. The repo still contains normal process-time behavior and non-deterministic edges that a stricter workflow engine would isolate more aggressively.

Examples:

RunStore.save() writes snapshots, not an append-only command history that drives all replay.
finalize_report() in src/harness_engineering/tools.py performs direct file I/O as a tool side effect.
flaky_echo() exists precisely to model retryable non-deterministic behavior.
model-backed planning/review in src/harness_engineering/reviewer.py and src/harness_engineering/provider.py are external calls, not replay-safe pure workflow logic.

That does not make the repo wrong. It just means the right claim is modest: the demo shows checkpointed pause/resume with inspectable state, not full deterministic workflow replay.

The most useful durable-execution lesson in the repo

The approval gate is still the best lesson.

A lot of agent systems treat human approval as something outside the runtime. The model says, “Should I continue?” and then the application sort of remembers the answer.

That is weak design.

In this repo, approval is represented as a durable workflow state transition:

a run enters waiting_approval
pending_action names the blocked operation
approve() flips approval state and records approval_granted
resume() continues execution from the stored step

That is the right pattern because the system remains legible even if the process exits between those moments.

An operator can come back later and answer:

Is this run still blocked?
What action is pending?
Has it already been approved?
Can it resume?
Where will the final artifact go?

Those are system questions, not prompt questions.

A real run that shows both value and limitation

I ran the repo locally against its configured OpenAI-compatible endpoint before publishing this post:

cd /home/james/.openclaw/workspace/harness-engineering
PYTHONPATH=src python3 -m harness_engineering.cli doctor

That check succeeded against the local provider with status: "ok" and message: "MODEL_OK".

Then I ran a real demo topic for this article. The interesting result was not a clean success path. It was a failure path that proves why persisted summaries matter.

The run drafted the markdown successfully using the local model provider, but the reviewer path in review_markdown() inside src/harness_engineering/reviewer.py got back fenced JSON instead of clean JSON. Because that function currently expects plain JSON from the reviewer model, the run was marked failed.

That is annoying. It is also honest and useful.

The new summary output made the failure legible immediately:

status: "failed"
current_step: "draft_report"
planner: "openai_compatible"
reviewer: "openai_compatible"
review_passed: false
review_findings contained the truncated non-JSON reviewer output

And the new history output showed the exact progression:

tool_ok for search_mock
tool_ok for extract_facts
tool_ok for draft_report
draft_reviewed with the reviewer failure payload

That is exactly what durable execution surfaces are for. Not just happy paths. Postmortems.

Without that summary/history layer, the failure would still exist, but understanding it would require more manual inspection of raw state and trace files. With the new layer, the failure becomes an inspectable system event.

Why durability changes the architecture

Once you care about pause/resume, several design choices stop being optional.

1. State must be explicit

You cannot resume cleanly from vibes. The RunState model exists because the system needs named fields for status, step, pending actions, artifacts, and prior step results.

2. Risk boundaries must be explicit

The repo treats finalize_report as risky in src/harness_engineering/tools.py, and the runner treats that risk as a state boundary. Good. Durable systems need named gates.

3. Operator inspection must be first-class

This is what the new summary and history commands add. If only the original developer can understand whether a run is safe to resume, the durability story is weak.

4. Traces matter as much as checkpoints

A snapshot tells you where the system is. A trace helps explain how it got there. The repo uses add_trace() in src/harness_engineering/tracing.py to record events like tool_start, tool_ok, draft_reviewed, approval_required, approval_granted, and run_resumed.

That event vocabulary is small, but it is enough to reason about the run.

5. Retries must be visible

RetryPolicy.call() in src/harness_engineering/runner.py already records attempts into StepResult. That matters because retry state changes the interpretation of failures and latency. Durable systems should not hide repeated attempts inside a black box.

What the demo proves

The current repo proves a few narrow but important things.

1. Pause/resume can be modeled cleanly without a giant framework

A small explicit runner plus a persisted RunState is enough to demonstrate coherent approval-gated resumption.

2. Approval works better as workflow state than as chat etiquette

waiting_approval, pending_action, approve(), and resume() make the control boundary durable and inspectable.

3. Persisted summaries materially improve operator usability

Writing summary.json next to state.json and trace.json, and exposing that via cmd_summary() and cmd_history(), makes the demo much more honest as a durability example.

4. Failures become actionable when trace and state are easy to inspect

The reviewer JSON-format failure from the local provider path was not pleasant, but the new summary/history surfaces made it obvious what happened.

What it still does not solve

This is the part many agent articles skip. I do not think you should.

1. It is not a deterministic replay engine

This repo checkpoints state and records traces, but it does not replay workflow code against an authoritative event history the way Temporal does.

2. It does not provide distributed fault tolerance

Everything is local and file-based under .runs/. That is fine for a demo, but not the same thing as resilient multi-worker execution.

3. Side effects are not isolated with strong idempotency guarantees

finalize_report() writes directly to disk. In a bigger system, you would want stronger side-effect semantics, especially across retries and restarts.

4. Reviewer robustness is still weak

The local-model path in review_markdown() can still fail when the reviewer returns fenced JSON or extra prose. That is a real repo limitation today.

5. Resume semantics are step-based, not arbitrary time travel

You can resume from the saved run state. You cannot fork alternate trajectories from arbitrary checkpoints the way richer checkpointing systems can.

The practical engineering rule

If you are building an agent system, ask a blunt question early:

What happens if the process stops halfway through?

If the answer is:

“we lose the run”
“the operator has to start over”
“approval was only in chat history”
“I need to read logs manually to know what step failed”

then the system is not ready for real work, no matter how impressive the prompt looks.

A durable harness does not need to start life as Temporal or Cadence or a full graph platform. But it does need explicit state, restart-aware boundaries, visible traces, and an inspection surface an operator can actually use.

That is why this repo change was worth making before publishing the post. It raises the demo from “yes, it stores files” to “yes, it exposes a coherent pause/resume story with operator-readable metadata.”

That is not the end of durable execution. It is the beginning of it.

References

67 AI Lab, harness-engineering repository: https://github.com/67ailab/harness-engineering
Temporal docs, “Temporal Workflow”: https://docs.temporal.io/workflows
LangGraph docs, “Persistence”: https://docs.langchain.com/oss/python/langgraph/persistence
Anthropic, “Building effective agents”: https://www.anthropic.com/engineering/building-effective-agents

What changed in the repo since the previous post#

What durable execution actually means#

Where the current demo’s durability lives#

Why the new summary/history surface matters#

The difference between resumable state and real replay#

The most useful durable-execution lesson in the repo#

A real run that shows both value and limitation#

Why durability changes the architecture#

1. State must be explicit#

2. Risk boundaries must be explicit#

3. Operator inspection must be first-class#

4. Traces matter as much as checkpoints#

5. Retries must be visible#

What the demo proves#

1. Pause/resume can be modeled cleanly without a giant framework#

2. Approval works better as workflow state than as chat etiquette#

3. Persisted summaries materially improve operator usability#

4. Failures become actionable when trace and state are easy to inspect#

What it still does not solve#

1. It is not a deterministic replay engine#

2. It does not provide distributed fault tolerance#

3. Side effects are not isolated with strong idempotency guarantees#

4. Reviewer robustness is still weak#

5. Resume semantics are step-based, not arbitrary time travel#

The practical engineering rule#

References#