Memory Architecture for Agents: Context, Sessions, and State

The live demo repo for this series is 67ailab/harness-engineering, and for this post I did change the repo before publishing. The new capability shipped in commit d20e352, which adds an explicit memory-layer model to the demo instead of treating every stored value as one blurry thing called “memory.”

The core addition is src/harness_engineering/memory.py, plus wiring in src/harness_engineering/store.py and src/harness_engineering/cli.py so every run now emits a memory.json snapshot and the CLI exposes a memory command.

That matters because “memory” is one of the most overloaded words in agent systems. People use it to mean at least four different things:

the prompt context sent to the model right now
chat or thread history across turns
durable workflow state for pause/resume
retrieved knowledge from documents or prior runs

Those are not the same problem. If you collapse them into one bucket, your architecture gets confused fast.

My practical claim in this post is simple: agent memory is not one store. It is at least three different engineering concerns with different failure modes: working context, session state, and retrieval memory.

The harness-engineering repo now demonstrates that separation directly.

What changed in the repo since the previous post

Post 5 focused on durable execution and added summary.json, trace history inspection, and CLI commands for summary and history.

For Post 6, the repo needed a memory model that made a sharper distinction between:

what the runner needs right now
what the system must persist across interruption
what can be looked up on demand

So I added a new module: src/harness_engineering/memory.py.

The key functions are:

build_working_context(state)
build_session_state(state)
retrieve_memory(state, query, top_k)
build_memory_snapshot(state, query, top_k)

Then RunStore.save() in src/harness_engineering/store.py was extended to write a per-run memory.json file next to state.json, trace.json, and summary.json. RunStore.memory_path() and the summary payload now expose that path. And src/harness_engineering/cli.py gained a new memory subcommand:

PYTHONPATH=src python3 -m harness_engineering.cli memory --latest
PYTHONPATH=src python3 -m harness_engineering.cli memory --latest --query "approval state" --top-k 3

I also updated README.md and added tests in tests/test_harness.py to verify that:

the snapshot contains separate layers
retrieval returns matching entries
memory.json is persisted on save
the CLI command works

That is a small repo change by line count, but it clarifies the whole architecture.

Why “memory” is such a bad umbrella term

LangGraph’s memory docs make a useful first cut: short-term memory is thread-scoped state, while long-term memory is cross-session data recalled later from a store. The OpenAI Agents SDK session docs make another useful distinction: session memory is the stored conversation history that gets prepended between runs. Temporal’s workflow docs make a different but equally important distinction: workflow recovery depends on durable event history and replay-safe state reconstruction, not vague “memory.”

Those systems differ in design, but they agree on one thing: you should not pretend that all remembered information is the same kind of thing.

In practical harness design, I think the minimum useful split is this:

1. Working context

This is the small slice of information the next step needs right now.

It should be:

current
bounded
cheap to assemble
intentionally pruned

If it grows without discipline, latency rises, cost rises, and model quality usually drops.

2. Session state

This is the durable state of the run itself.

It should answer questions like:

What run is this?
What step is it on?
Is approval pending?
What artifacts already exist?
What trace events already happened?

This is the state you need for pause/resume, auditing, inspection, and operations.

3. Retrieval memory

This is not always “memory” in the human sense. Often it is just a searchable pool of relevant material that can be fetched when needed.

It should be:

queryable
relevance-ranked
optional to pull into the working context
separable from durable workflow state

The repo now models those three layers explicitly.

Where the three layers live in the current demo

The durable run model is still RunState in src/harness_engineering/models.py. That is the backbone of the harness. It holds fields like:

run_id
status
current_step
requires_approval
approved
pending_action
plan
source_documents
artifacts
trace
step_results

But RunState itself is not yet a good explanation of memory architecture. It is just the raw data model.

The explanation now lives in src/harness_engineering/memory.py.

Working context: `build_working_context()`

build_working_context(state) extracts the minimum operator-visible context for the next execution moment:

topic
current_step
status
requires_approval
pending_action
plan_outline
a preview of extracted facts
a compact preview of the draft markdown
review pass/fail state

That is intentionally not everything in the run. It is the “what matters right now” layer.

In other words: this is the layer you might actually hand to a model or an operator before the next step.

Session state: `build_session_state()`

build_session_state(state) is the opposite. It is not trying to be small; it is trying to be durable and legible.

It includes:

run metadata
approval flags
the full plan outline
artifact keys
a compact list of prior step_results
counts of trace events
last_error

This is the layer that makes pause/resume explainable.

It lines up naturally with the rest of the durable execution code in:

HarnessRunner.run_until_pause_or_complete() in src/harness_engineering/runner.py
HarnessRunner.approve()
HarnessRunner.resume()
RunStore.save() in src/harness_engineering/store.py

The session-state layer is what lets the repo treat approval as workflow state instead of chat etiquette.

Retrieval memory: `retrieve_memory()`

retrieve_memory(state, query, top_k) is the most important conceptual addition.

It searches over two things already present in the run:

state.source_documents
extracted facts under state.artifacts["facts"]

This is deliberately simple. There are no embeddings, no vector database, and no external retrieval service yet. Matching is just lightweight token overlap. But the architectural point still lands: retrieval memory is fetched by query instead of shoved wholesale into the working context.

That distinction matters more than the sophistication of the ranking algorithm.

In a real system, this layer might be backed by:

a vector store
BM25 or hybrid search
a profile store
a document namespace
prior-run trace summaries
external knowledge bases

The demo does not have those yet. Good. It keeps the interface visible.

A real run that shows the separation

I ran the updated repo locally after adding the new memory layer.

First I verified model connectivity using the repo’s preferred local-provider setup:

cd /home/james/.openclaw/workspace/harness-engineering
PYTHONPATH=src python3 -m harness_engineering.cli doctor

That succeeded against the configured OpenAI-compatible endpoint with:

provider: openai_compatible
model: gemma4
status: ok
message: MODEL_OK

Then I ran the harness and inspected memory:

HARNESS_MODEL_PROVIDER=mock PYTHONPATH=src python3 -m harness_engineering.cli start \
  --topic "memory architecture for approval-gated agent harnesses" \
  --source-file sample_data/sources.json

HARNESS_MODEL_PROVIDER=mock PYTHONPATH=src python3 -m harness_engineering.cli memory --latest \
  --query "approval state" --top-k 3

The output is exactly the kind of thing I want from a teaching repo.

The working_context layer showed:

the topic
current_step
current run status
a short facts preview
a compact preview of the generated markdown
review status

The session_state layer showed:

durable run identifiers and timestamps
approval booleans
artifact keys
compact step results
trace event counts

The retrieval_memory layer returned only the entries relevant to the query approval state, including both extracted facts and source documents.

That is the point. The operator does not have to pretend every remembered thing belongs in one list.

The subtle but important design rule

The best way to say this is bluntly:

Do not use your durable session state as your model context window.

That mistake is everywhere.

If you pour all run metadata, traces, old tool outputs, documents, and user history into one growing prompt, you get the worst of all worlds:

token bloat
degraded model attention
fragile prompt construction
unclear audit boundaries
poor operator ergonomics

The new repo structure resists that.

RunState remains the source of durable truth for execution. build_working_context() constructs a narrow operational slice. retrieve_memory() pulls relevant material only when asked. That is a healthier architecture than “the model can see everything.”

Where this aligns with broader agent tooling

LangGraph’s memory model is helpful here because it separates thread-scoped state from long-term stores. That is close to the split between working/session state and retrieval memory in this repo, even though the implementation is much lighter.

The OpenAI Agents SDK sessions docs are helpful for a different reason: they make session history a first-class runtime concern rather than something the developer manually reassembles each turn. That is the same instinct behind the repo’s session_state layer, even though this demo is not a chat agent.

Temporal is useful as the stricter systems reference point. Temporal’s workflow docs emphasize that event history is the source of truth and that replay-safe recovery is not the same thing as restoring a vague snapshot. That matters here because the repo’s session-state layer is useful, but it is still not full deterministic replay.

That distinction is worth keeping clear:

this repo now demonstrates memory-layer separation
it already demonstrated checkpointed pause/resume
it still does not implement Temporal-style replay semantics

That is fine. Honest scope is a strength.

What the demo proves

The updated repo proves a few things clearly.

1. “Memory architecture” can be demonstrated without a giant framework

You do not need a massive orchestration platform to make the distinctions visible. A small local harness with explicit functions is enough.

2. Working context and durable session state are different layers

build_working_context() and build_session_state() are small functions, but they make a big conceptual correction. The data you need to operate the workflow is not the same as the data you should stuff into the next model call.

3. Retrieval memory is best treated as an explicit query surface

Even with simple lexical matching, retrieve_memory() demonstrates the right interface boundary: ask for relevant items when needed instead of serializing everything into the run context.

4. Memory architecture becomes much easier to inspect when persisted

Because RunStore.save() now writes memory.json, the three-layer view survives process exit just like state.json, trace.json, and summary.json do.

That is small, but operationally useful.

What it still does not solve

This section matters more than the flashy one.

1. Retrieval is still primitive

retrieve_memory() uses simple token overlap on in-run documents and facts. There is no embedding index, hybrid retrieval, namespace isolation, or recency weighting.

2. The demo does not yet separate cross-run long-term memory from per-run retrieval

Right now retrieval works over source_documents and extracted facts attached to a run. That is useful, but it is not the same thing as a cross-session memory store.

3. Working-context budgeting is still crude

The repo now exposes the concept, but it does not yet measure token budgets or automatically compress context based on model constraints.

4. Reviewer robustness is still weak on the local-model path

This is a real observed limitation from the live repo. During a local-model run, review_markdown() in src/harness_engineering/reviewer.py still failed when the reviewer returned fenced JSON instead of raw JSON. The harness correctly exposed that failure in the saved memory and state, but it did not gracefully recover from it.

5. This is not a full memory operating system for agents

There is no user-profile layer, no episodic memory store across runs, no semantic recall over large corpora, and no background memory-writing pipeline.

Again: that is fine. The repo is a teaching system, not a claims machine.

The practical engineering takeaway

If your agent system has a single field called memory and you keep pouring more things into it, you probably have an architecture problem.

Ask three separate questions instead:

What does the next step need right now?
What must survive interruption and remain operator-visible?
What can be looked up only when relevant?

If you answer those separately, your harness gets better almost immediately:

prompts shrink
pause/resume becomes clearer
audits get easier
retrieval gets more disciplined
the system becomes easier to extend

That is why this repo change was worth making before publishing the post. The concept now exists as real code rather than blog abstraction.

The repo still has a long way to go. It should eventually add token-budget accounting, stronger retrieval, and maybe a true cross-run memory store. But the architecture is better now because it stopped pretending memory is one thing.

That is the useful lesson.

References

67 AI Lab, harness-engineering repository: https://github.com/67ailab/harness-engineering
LangGraph documentation, “Memory overview”: https://docs.langchain.com/oss/python/concepts/memory
OpenAI Agents SDK documentation, “Sessions”: https://openai.github.io/openai-agents-python/sessions/
Temporal documentation, “Workflows”: https://docs.temporal.io/workflows

What changed in the repo since the previous post#

Why “memory” is such a bad umbrella term#

1. Working context#

2. Session state#

3. Retrieval memory#

Where the three layers live in the current demo#

Working context: build_working_context()#

Session state: build_session_state()#

Retrieval memory: retrieve_memory()#

A real run that shows the separation#

The subtle but important design rule#

Where this aligns with broader agent tooling#

What the demo proves#

1. “Memory architecture” can be demonstrated without a giant framework#

2. Working context and durable session state are different layers#

3. Retrieval memory is best treated as an explicit query surface#

4. Memory architecture becomes much easier to inspect when persisted#

What it still does not solve#

1. Retrieval is still primitive#

2. The demo does not yet separate cross-run long-term memory from per-run retrieval#

3. Working-context budgeting is still crude#

4. Reviewer robustness is still weak on the local-model path#

5. This is not a full memory operating system for agents#

The practical engineering takeaway#

References#