Human-in-the-Loop Done Properly

The live demo repo for this series is 67ailab/harness-engineering, and for this post I did change the repo before publishing. The new capability shipped in commit 352fba2, which adds a first-class pending-approval inspection surface to the existing approval-gated harness. The key changes are in src/harness_engineering/runner.py, src/harness_engineering/cli.py, and src/harness_engineering/store.py.

That matters because most writing about “human in the loop” in agent systems is still weirdly sloppy. A model says “should I proceed?”, a human types “yes”, and the demo declares the governance problem solved. It is not solved. In production, approval is not a vibe, not a chat convention, and not a magical hidden boolean inside the runtime. It is a workflow boundary with state, context, inspection, and recovery semantics.

My practical claim in this post is simple: human-in-the-loop only becomes real when approval is treated as an explicit runtime primitive with a durable pending action, inspectable details, and a clean resume path.

The harness-engineering repo now demonstrates that idea a little more honestly than it did yesterday.

What changed in the repo since the previous post

Post 6 separated working context, session state, and retrieval memory. That was useful groundwork, but it also exposed a gap: the repo could pause for approval, yet the operator-facing approval surface was still too implicit.

Before this update, the harness already had:

a risky step: finalize_report
a pause state: waiting_approval
explicit workflow flags on RunState in src/harness_engineering/models.py
CLI commands to approve and resume
an interactive mode in src/harness_engineering/cli.py

That was enough to show the shape of approval gating. It was not enough to show good operator ergonomics.

So for Post 7 I added one small but important capability: structured pending-action inspection.

The concrete repo changes are:

HarnessRunner.run_until_pause_or_complete() in src/harness_engineering/runner.py now builds and persists state.artifacts["pending_action_details"] before entering the approval gate.
That payload includes:
- the pending action name
- the tool name
- whether the tool is risky
- the step that requested it
- the reason approval is required
- the proposed output path
- a draft preview with line count, character count, and excerpt
- the reviewer result
- the next CLI commands to inspect, approve, and resume
RunStore.build_summary() in src/harness_engineering/store.py now includes pending_action_details in the saved summary surface.
src/harness_engineering/cli.py gained a new pending command so operators can inspect the current approval request without dumping raw state.
The interactive command now prints the approval reason, output path, and draft size before asking for confirmation.
tests/test_harness.py now verifies the new summary payload and the new CLI command.
README.md documents the new inspection flow.

That is not a huge framework feature. It is better than that. It is a sharp improvement in the one place where a human actually meets the harness.

Why “human in the loop” is often implemented badly

There are a few common failure modes.

1. Approval is treated as chat text, not system state

The agent says, “I’m about to do something risky. Continue?” The user replies “yes.” The app then has to infer:

which action that “yes” applies to
whether it is still the current pending action
whether the pending action changed after another model turn
whether the response should approve, edit, reject, or just answer a question

That is fragile.

2. The human cannot see enough to make a decision

Real approval requires context. Not infinite context, but enough context.

If the operator cannot inspect:

what tool is about to run
why it was classified as risky
what side effect will happen
what artifact will be written or changed
what draft or arguments are being approved

then approval is fake theater.

3. Pause and resume are bolted on after the fact

Many agent demos pause by accident rather than by design. The process blocks on input, or a web UI waits for a click, but the runtime has no durable model of “there is a pending approval request with these exact parameters.” That breaks as soon as the process exits or the reviewer comes back later.

4. Approval has no audit trail

If the only record is chat history, you have a compliance problem, an operator problem, and a debugging problem.

Good HITL design needs traceable state transitions.

What the current demo does

The live repo is still intentionally small. That is part of why it is useful. The approval flow is easy to see.

The runtime backbone remains HarnessRunner in src/harness_engineering/runner.py plus RunState in src/harness_engineering/models.py.

The main path is still:

search_mock
extract_facts
draft_report
reviewer check
approval gate before finalize_report
finalize_report writes the markdown file

The important design choice is that finalize_report in src/harness_engineering/tools.py is marked risky in default_registry(). That means the harness has a named side-effect boundary instead of pretending file writes are ordinary model output.

Inside HarnessRunner.run_until_pause_or_complete(), once draft_report succeeds and the reviewer passes, the runner does not immediately call finalize_report. Instead it:

computes the future output path
creates a structured pending-action payload
stores that payload under state.artifacts["pending_action_details"]
sets state.current_step = "finalize_report"
sets state.requires_approval = True
sets state.pending_action = "finalize_report"
sets state.status = "waiting_approval"
emits an approval_required trace event via add_trace() from src/harness_engineering/tracing.py
saves the run through RunStore.save()

That is the heart of the post. The human approval is not outside the machine. It is represented inside the runtime model.

The new operator-facing approval surface

The new CLI command is the cleanest example:

PYTHONPATH=src python3 -m harness_engineering.cli pending --latest

That command is implemented as cmd_pending() in src/harness_engineering/cli.py. It loads the run and prints a structured payload containing the current approval state and the stored pending_action_details.

A real output now includes fields like:

run_id
status
requires_approval
pending_action
details.action
details.tool_name
details.tool_risky
details.requested_by_step
details.reason
details.proposed_output_path
details.draft_preview
details.review
details.next_commands

That is much closer to what an actual operator panel should expose.

The same details also flow into RunStore.build_summary() in src/harness_engineering/store.py, which means .runs/<run_id>/summary.json now carries approval context instead of only generic run metadata.

That may sound minor. I think it is exactly the right kind of minor. Human-in-the-loop systems usually fail on this boring connective tissue, not on the headline demo step.

Why approval is a workflow primitive, not a UI prompt

The best recent agent frameworks point in the same direction, even when they implement it differently.

LangChain’s HITL middleware docs describe interrupts that pause execution based on tool-specific policy, with graph state persisted via LangGraph’s persistence layer. The important part is not the exact API. The important part is the model: tool calls are checked against policy, execution interrupts, state is saved, and the run resumes after a human decision. That is runtime behavior, not chat etiquette.

The OpenAI Agents SDK docs make a similar point from another angle. When a tool requires approval, the run pauses, returns interruptions, and later resumes from the same RunState. Their docs also make clear that approvals can surface at the outer run even when the tool belongs to a nested agent-as-tool execution. Again, the key lesson is architectural: approvals belong to run state.

Temporal is the stricter systems reference. Temporal is not an agent framework, but its documentation is useful because it is uncompromising about durable execution. If your workflow pauses for human review, that pause has to be part of the workflow semantics, not an accidental property of a foreground process.

This repo is much smaller and less rigorous than Temporal. Good. It should not pretend otherwise. But it is now teaching the right instinct.

A real verified flow in the repo

Before writing this post, I verified the live repo state and local-model connectivity.

First I ran the required checks in /home/james/.openclaw/workspace/harness-engineering:

make check
PYTHONPATH=src python3 -m harness_engineering.cli doctor

Those passed.

make check ran the test suite and the secret scan.
The updated test suite passed with 23 tests.
scripts/secret_scan.py reported no obvious secrets in tracked files.
doctor succeeded against the repo’s configured local OpenAI-compatible endpoint with:
- provider: openai_compatible
- model: gemma4
- base URL: http://192.168.0.16:8080/v1
- status: ok
- message: MODEL_OK

I also exercised the new approval surface through the CLI-driven tests and observed a real pending payload that included:

risky tool: finalize_report
reason text explaining that it writes the reviewed markdown report to disk
a concrete proposed output path under .runs/<run_id>/final_report.md
a draft preview with line count and excerpt
reviewer status
exact follow-up commands

That matters because it means the article is describing the repo that exists now, not a hypothetical future branch.

The practical design rule: approvals need enough context, but not too much

One subtle design problem with HITL systems is overexposure.

If the approval surface is too thin, the human cannot make a responsible decision. If the approval surface is too thick, operators drown in irrelevant data.

The current repo aims for a middle ground.

The pending_action_details payload includes the context needed to answer:

What is about to happen?
Why is this risky?
What file will be written?
What content is being approved?
Did the reviewer already pass this draft?
What do I run next?

But it does not dump every trace event, every source document, and every intermediate result into the approval response.

That is the right instinct. Approval UIs should be decision-oriented.

What the demo proves

1. Approval can be represented as explicit durable state

This repo no longer relies on a human remembering what they were approving. RunState plus pending_action_details make the approval boundary explicit and persistent.

2. Operator inspection should be first-class

cmd_pending() in src/harness_engineering/cli.py is a small addition, but it models an important design principle: do not make operators parse raw internal state just to answer a simple workflow question.

3. Human review is more than approve-or-block

The current demo still implements a binary approval step, but the structure now makes the richer future obvious. Once a pending action is a named object with context, the system can evolve toward:

approve
reject
edit arguments
annotate with reason
delegate to policy or role-based approvers

That is much easier when approval is already modeled as a runtime object.

4. HITL belongs next to tracing and durability, not separate from them

Because the repo persists state through RunStore.save() and appends trace events through add_trace(), the approval story is inspectable after the fact. That is operationally meaningful.

What it still does not solve

This section matters more than the fancy one.

1. The demo still has only one approval mode

Right now the practical decision is still yes or no around finalize_report. The system does not yet support editing tool arguments, rejecting with operator feedback to the model, or multi-party approval chains.

2. Policy is still mostly hard-coded

The risky boundary is encoded in the tool registry and runner behavior. That is fine for a teaching repo, but a broader system would want configurable policy files, role mapping, and environment-specific approval rules.

3. There is no dedicated web approval UI

The CLI surface is honest and useful, but it is still a CLI surface. Many real systems need a queue, audit dashboard, and reviewer identity model.

4. It is not a full deterministic workflow runtime

The repo persists state and resumes cleanly, but it is not Temporal-style deterministic replay. It is a local file-backed harness, not a distributed workflow engine.

5. Reviewer quality still depends on provider behavior

As earlier posts in this series already showed, model-backed review remains a real failure surface. If the reviewer returns malformed structure or drifts stylistically, the approval boundary may never be reached, or the run may fail before human review even begins.

6. The approval object is informative, but not yet identity-aware

The repo records the pending action and now annotates approval timing, but it does not yet track who approved, under what role, under which policy version, or from which interface.

That will matter as soon as the harness is used by more than one operator.

What I think the industry gets wrong here

There is too much focus on whether the model “asks for permission” and not enough focus on whether the runtime can prove what was pending, what was approved, and what resumed.

If I had to reduce this whole post to one engineering test, it would be this:

Could another operator, arriving later and cold, inspect the system and understand exactly what decision is being requested?

If the answer is no, then your human-in-the-loop design is probably cosmetic.

That is why I like the new repo change even though it is modest. It moves the harness toward operator legibility.

Not toward a demo where the model sounds polite.

The practical takeaway

When you add human review to an agent system, do not start with the prompt. Start with the runtime object model.

Define:

what action is pending
why it needs review
what side effect it would cause
what context the reviewer needs
how that request is persisted
how approval changes run state
how execution resumes afterward
how the entire thing is traced

Then add the UI or chat layer on top.

That sequence matters. If you reverse it, you usually end up with a conversational illusion wrapped around a weak workflow.

The live harness-engineering repo is still small, still local, and still incomplete. But with pending_action_details, cmd_pending(), and the updated summary surface, it now demonstrates a more mature truth:

human-in-the-loop is not a message. It is a state transition.

That is the difference between a friendly demo and a system you can begin to trust.

References

67 AI Lab, harness-engineering repository: https://github.com/67ailab/harness-engineering
LangChain documentation, “Human-in-the-loop”: https://docs.langchain.com/oss/python/langchain/human-in-the-loop
OpenAI Agents SDK documentation, “Human-in-the-loop”: https://openai.github.io/openai-agents-js/guides/human-in-the-loop/
OpenAI API documentation, “Guardrails and human review”: https://developers.openai.com/api/docs/guides/agents/guardrails-approvals
Temporal documentation: https://docs.temporal.io/

What changed in the repo since the previous post#

Why “human in the loop” is often implemented badly#

1. Approval is treated as chat text, not system state#

2. The human cannot see enough to make a decision#

3. Pause and resume are bolted on after the fact#

4. Approval has no audit trail#

What the current demo does#

The new operator-facing approval surface#

Why approval is a workflow primitive, not a UI prompt#

A real verified flow in the repo#

The practical design rule: approvals need enough context, but not too much#

What the demo proves#

1. Approval can be represented as explicit durable state#

2. Operator inspection should be first-class#

3. Human review is more than approve-or-block#

4. HITL belongs next to tracing and durability, not separate from them#

What it still does not solve#

1. The demo still has only one approval mode#

2. Policy is still mostly hard-coded#

3. There is no dedicated web approval UI#

4. It is not a full deterministic workflow runtime#

5. Reviewer quality still depends on provider behavior#

6. The approval object is informative, but not yet identity-aware#

What I think the industry gets wrong here#

The practical takeaway#

References#