The live demo repo for this series is 67ailab/harness-engineering, and for this post I did add a real new capability before publishing. The repo now includes a small MCP-style adapter layer in src/harness_engineering/mcp.py, plus CLI entry points to inspect tool descriptors and call tools through that boundary. The exact repo change shipped in commit e21f361.

That addition matters because this is the first point in the series where the demo has to answer a question the broader ecosystem now forces on every agent builder: what exactly is the boundary between your harness and the tool protocol?

Function calling got the industry comfortable with the idea that models can select structured operations instead of only emitting prose. MCP pushed that conversation further by standardizing how tools can be described and called across hosts and servers. Both are useful. Neither is the whole system.

The practical claim of this article is simple: tool schemas and MCP are interface improvements, not replacements for harness engineering. They help standardize discovery and invocation. They do not solve approvals, durability, retries, audit trails, or state transitions. In the demo repo, the new adapter layer makes that distinction easier to see in code.

What changed in the repo since the previous post

The main new file is src/harness_engineering/mcp.py. It introduces five functions that form a protocol-friendly boundary around the existing tool registry:

  • tool_to_mcp_descriptor()
  • registry_to_mcp_tools()
  • validate_tool_arguments()
  • call_tool()
  • call_tool_mcp()

Two more changes make that boundary real rather than decorative.

First, src/harness_engineering/cli.py now includes:

  • mcp-tools to print MCP-style tool descriptors
  • mcp-call to invoke a tool through the adapter with JSON arguments

Second, HarnessRunner._execute() in src/harness_engineering/runner.py no longer calls the raw tool handler directly. It routes execution through call_tool(), which means the harness now validates arguments against the declared tool schema before execution.

That is the right kind of upgrade for this stage of the series: small, concrete, and architecturally meaningful.

Why tool calling changed agent design

The older pattern in LLM applications was mostly “generate text, then parse intent out of it.” That works until it doesn’t. Once the model is expected to interact with files, APIs, or application state, free-form text becomes a bad control surface.

Tool calling improved that by letting models emit a structured request: call this function, with these arguments. OpenAI’s API docs frame tools as a way to give models access to external capabilities, including function calls and remote MCP servers. Anthropic describes a similar split between model-produced tool requests and application-side execution for client tools. In both designs, the model decides when a tool may be helpful, but your application still decides how execution really happens.

That distinction is easy to miss when demos are too thin. If all you see is a model choosing a function name and some JSON, it can feel as if the protocol solved the hard problem. It didn’t. It solved one important problem: make actions legible enough to be mediated in software.

That is already a big deal. Once tool inputs are structured, you can:

  • validate arguments
  • reject unknown fields
  • reason about risk classes
  • log requests cleanly
  • build stable adapters across providers
  • write tests against tool contracts

Those are harness-friendly properties. Tool calling makes them possible. Harness engineering decides what you do with them.

What MCP adds beyond ordinary function calling

The MCP specification pushes this idea toward interoperability. Instead of every app inventing its own opaque tool definition format, MCP describes a common shape for listing and calling tools:

  • tools have names and descriptions
  • they expose inputSchema
  • they can optionally expose annotations and output schemas
  • clients can call them through a standard request/response pattern

Just as important, the spec is explicit about trust boundaries. The tools section recommends human-in-the-loop confirmation for sensitive operations and says servers must validate inputs. That is not accidental wording. It is an admission that standardized tools still live inside risky workflows.

OpenAI’s remote MCP documentation makes the same point in a more productized form. The Responses API can import tools from a remote MCP server, surface mcp_list_tools, and create mcp_call entries when the model uses them. But OpenAI also exposes approval controls such as require_approval, because merely having a standard tool transport does not mean a tool call should run without review.

This is the exact reason I wanted post three to add an adapter layer rather than a full server transport. The important design lesson comes earlier than the transport: a harness should be able to describe and validate its tool contracts in a protocol-friendly way without surrendering orchestration to the protocol.

The repo’s original tool model was already close

Before this change, the repo already had a decent internal abstraction in src/harness_engineering/tools.py:

  • Tool
  • ToolRegistry
  • default_registry()

Each Tool includes:

  • name
  • description
  • input_schema
  • risky
  • handler

That was always one small step away from an MCP-style view. The internal schema format was intentionally minimal—just a mapping like {"topic": "str", "facts": "list[str]"}—but the semantics were already there.

The new mcp.py layer turns that internal representation into something closer to a protocol boundary. tool_to_mcp_descriptor() maps each internal field type into a JSON-Schema-like fragment and emits a descriptor with:

  • name
  • description
  • inputSchema
  • annotations
  • meta

The interesting design choice is that the adapter does not pretend internal and external semantics are identical. The repo still keeps a local meta.risky flag and still uses annotations.readOnlyHint as a convenience signal. In other words, the harness exports a protocol-shaped description without claiming that the protocol alone captures every policy decision.

That is healthy. Protocols should travel. Risk policy usually needs local context.

The most important new behavior is argument validation

The most meaningful function in src/harness_engineering/mcp.py is probably not the descriptor export. It is validate_tool_arguments().

That function checks three things before a tool executes:

  1. required arguments are present
  2. unexpected arguments are rejected
  3. declared types roughly match actual values

The type system is intentionally small. _matches_type() currently recognizes values like str, bool, int, float, dict, list, list[str], and list[dict]. This is not a full JSON Schema validator, and the post would be dishonest if it pretended otherwise. But it does something important: it upgrades the registry from “descriptive metadata” to “executable contract.”

That in turn changes the runner.

In src/harness_engineering/runner.py, HarnessRunner._execute() now calls:

result = self.retry.call(tool_name, call_tool, self.registry, tool_name, kwargs)

instead of invoking tool.handler(**kwargs) directly.

This is a subtle but meaningful shift. The runner now treats tool invocation as a mediated act rather than a direct function call. That is a precondition for a lot of future work:

  • stricter schema enforcement
  • MCP transport exposure
  • policy hooks before execution
  • richer per-tool logging
  • provider-neutral adapters

It is also just safer. If a caller passes the wrong shape, the harness now records a tool error rather than letting arbitrary mismatched kwargs drift into the handler.

The CLI makes the interface visible

A good adapter should not live only in library code. It should be inspectable from the command surface.

That is why src/harness_engineering/cli.py now exposes two commands.

mcp-tools

This prints the default registry in an MCP-style descriptor format. In the current repo, the output includes tools like:

  • search_mock
  • extract_facts
  • draft_report
  • finalize_report
  • flaky_echo

Each descriptor includes an inputSchema with type: object, named properties, required fields, and additionalProperties: false.

That last bit matters. It is a small but practical signal that the tool contract is closed by default. If the model or caller invents an extra field, the harness can reject it.

mcp-call

This invokes a registered tool through the adapter by passing JSON on the command line. The result is returned in an MCP-style shape with:

  • content
  • structuredContent
  • isError

That mirrors an important part of the MCP tools spec: structured content is helpful, but for compatibility the result should often also be serialized into text content. The repo now does exactly that in call_tool_mcp().

Again, the main point is not that the demo became a full MCP server overnight. It didn’t. The point is that the harness now has a protocol-shaped interface boundary that can be inspected and exercised independently of the orchestration loop.

What the demo proves

This repo is still deliberately small, but the new capability proves several things cleanly.

1. Tool schemas are more useful when they drive execution

Many demos attach schemas to tools as passive documentation. Here, the schema now directly mediates calls through validate_tool_arguments() and call_tool(). That is a real step toward reliable execution.

2. MCP-style interoperability does not require surrendering the harness

The repo can export MCP-style descriptors and results while keeping approvals, retries, traces, and state transitions firmly inside HarnessRunner, RunStore, and RunState. That is the architectural line I wanted the demo to show.

3. Risk is orthogonal to protocol shape

finalize_report is still risky even if its descriptor is perfectly structured. A clean schema does not magically make a disk write safe. That is why the harness continues to gate it through requires_approval, pending_action, and the waiting_approval state in src/harness_engineering/models.py and src/harness_engineering/runner.py.

4. The adapter layer is a better place for future portability work

If the repo later exposes a real JSON-RPC MCP server or imports external MCP-backed tools, mcp.py is the natural place to extend. That is better than scattering protocol logic across every tool handler.

What MCP still does not solve

This is the section too many agent posts skip.

1. MCP does not orchestrate your workflow

The protocol can describe tools and standardize calls. It does not decide whether your harness should be a loop, a graph, a queue-backed workflow, or a pause/resume state machine. In this repo, run_until_pause_or_complete() still owns that logic.

2. MCP does not make approvals disappear

If anything, MCP makes approval design more visible by making tool invocations clearer. But a host still needs a policy about when to ask, how to pause, how to resume, and what to log.

3. MCP does not give you durability

A tools/call operation is not the same thing as resumable execution. The repo’s durability still comes from .runs/<run_id>/state.json, .runs/<run_id>/trace.json, RunStore.save(), and RunStore.load().

4. MCP does not automatically validate everything you care about

The spec encourages validation, but you still have to implement it. The repo now has a lightweight validator, but it is not a complete JSON Schema engine and it does not yet validate tool outputs.

5. MCP does not erase trust boundaries

Remote MCP servers are powerful because they make new capabilities discoverable quickly. They are also risky because they bring external trust domains into model context. OpenAI’s connector/MCP docs warn developers to trust remote servers carefully, and the MCP spec itself treats tool annotations as untrusted unless the server is trusted.

That is exactly why local harness policy still matters.

Honest limitations of the current repo implementation

The new adapter layer is real, but it is intentionally modest.

First, src/harness_engineering/mcp.py emits MCP-style descriptors and results, not a full MCP server transport. There is no JSON-RPC endpoint, no pagination, and no capability negotiation.

Second, the schema conversion is lossy and minimal. Internal types like list[str] map nicely, but the repo is not yet expressing richer nested structures, enums, or output schemas.

Third, call_tool_mcp() currently represents all failures as tool execution results with isError: true. That is useful for demo clarity, but a full MCP server would also need proper protocol-level errors for unknown tools or malformed requests.

Fourth, the harness still uses a local risky flag rather than a deeper policy engine. That is fine for now. It is not a comprehensive authorization model.

Fifth, the reviewer path in src/harness_engineering/reviewer.py is still stricter than I would want for production because it expects raw JSON more than many local models reliably provide. That issue is separate from MCP, but it matters because protocol cleanliness in one boundary does not rescue brittleness in another.

Why this matters more than the hype cycle around MCP

MCP is getting talked about as if it were either the universal substrate of agents or just another thin standard destined to be abstracted away. I think both views miss the useful middle.

The useful middle is this:

  • tool schemas matter a lot
  • standard transport matters a lot
  • provider-neutral discovery matters a lot
  • but the runtime around those things still determines whether the system is governable

That is why I wanted the repo’s first MCP-related step to be an adapter layer rather than a flashy full-server demo. The conceptual lesson is easier to see when the code stays small:

  • tools.py defines local actions
  • mcp.py exports and validates protocol-shaped contracts
  • runner.py still owns retries and workflow progression
  • models.py still owns approval state
  • store.py still owns durability
  • tracing.py still owns run history

In other words, MCP fits into the harness. The harness does not collapse into MCP.

The practical takeaway

If you are building agents right now, you probably should care about MCP. You should care about tool schemas. You should want provider-neutral boundaries. But you should not confuse a clean tool interface with a complete system.

A good harness should be able to answer all of these questions even after adopting MCP-style tools:

  • What step is the run on?
  • What arguments were actually passed?
  • Was the tool call approved?
  • Is the action risky locally even if the schema looks harmless?
  • Can the run resume after interruption?
  • Is there a trace explaining what happened?
  • Did validation fail before the handler ran?

If your answer to those questions is still “the model probably knows,” then the problem is not your protocol. It is your harness.

MCP is real progress. It is just not the whole stack.

References

  1. 67 AI Lab, harness-engineering repository: https://github.com/67ailab/harness-engineering
  2. Model Context Protocol Specification, Tools: https://modelcontextprotocol.io/specification/2025-06-18/server/tools
  3. OpenAI API docs, “Using tools”: https://developers.openai.com/api/docs/guides/tools
  4. OpenAI API docs, “MCP and Connectors”: https://developers.openai.com/api/docs/guides/tools-connectors-mcp
  5. Anthropic docs, “Tool use with Claude”: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview