Chapter 10 — Long-Horizon Task Memory

Tenth post of the chapter-by-chapter walkthrough of LLM Primer IV: Designing AI Cognition with MCP. In which the question stops being "how much fits" and becomes "what to remember and what to forget," and the seven-figure context windows shipping today turn out to postpone the wall by an hour rather than remove it.

Why this chapter exists

An agent that runs for thirty seconds can carry everything it needs in its prompt. An agent that runs for three hours cannot. The work it does in the first hour will not fit alongside the work it does in the third, and the question of what to remember and what to forget becomes the central engineering problem. The context window is no longer a budget to be managed; it is a working surface that must be continuously refreshed against a deeper store. This chapter is about the architecture of remembering — short-term memory for immediate reasoning, long-term memory for persistence across sessions, and the compaction and externalisation techniques that connect them.

One line: short-term memory is not the model's memory but the agent loop's memory, materialised as text and injected on every call — which means every decision about what the model remembers is a decision the loop makes explicitly, in code, with no hidden state to debug.

10.1 Short-term memory: windows, scratchpads, ReAct

Short-term memory is whatever sits inside the current context window and is available without external lookup. The simplest policy is the sliding window: keep the system prompt and tool descriptions at the top, keep the most recent N turns at the bottom, drop everything in between. It works as long as the relevant context is recent, which is true for short conversations and false for almost everything else. The failure mode is clean — once a turn is dropped, it is gone — and the agent will visibly forget the user's instructions at the predictable point where the window first fills.

The next layer is the scratchpad, a structured region of context the model writes to deliberately. Internal scratchpads carry intermediate reasoning forward within the loop; external scratchpads write notes via a tool call into a stored buffer that future contexts inject. The pattern that gave scratchpads their canonical form is ReAct — Reason and Act — introduced by Yao and colleagues in 2022. The loop interleaves thought, action, observation, until the model decides it has the answer. The structure externalises reasoning into explicit textual artefacts the model can refer back to, and gives the agent loop visible scaffolding for memory operations: thoughts can be summarised, actions deduplicated, observations compacted. Agents built without ReAct or a close variant tend to entangle reasoning and action in ways that make their state opaque.

A practical complement is Reflexion, which adds an explicit reflection step in which the model evaluates its recent actions and writes a critique into the scratchpad for the next attempt. Modern agent frameworks blend the two into a single configurable loop, with reflection triggered by a failure signal rather than fired on every cycle.

10.2 Long-term memory: episodic and semantic

When short-term memory ends, long-term memory begins. The cognitive-science distinction between episodic (specific events) and semantic (general facts) memory has turned out to be useful for agents. Episodic memory is the record of specific past interactions; semantic memory is the distilled knowledge that survived — that this user prefers metric units, that this project's deploy command is make ship, that this API returns errors that look like success.

Episodic memory is, in current practice, almost always a vector database. Each past interaction is embedded, stored with metadata, and retrieved at query time by semantic similarity. The pattern is RAG applied to the agent's own past rather than to a corpus of documents, and the engineering — chunking, embedding choice, retrieval evaluation — is largely identical to what Volume III covers.

Semantic memory is less standardised. The two dominant substrates are structured key-value stores and knowledge graphs. Key-value stores are simple, fast, easy to inspect; graphs support multi-hop queries like "what is the deploy command for the project the user is currently working on" but require maintenance and a query language. Most production agents start with key-value and graduate to a graph only when the queries actually require joins. Many never do.

The update policy is where most teams get into trouble. A fact extracted from a single conversation is not necessarily true in general. A naive policy that promotes every assertion to semantic memory will produce a corrupted store that contradicts itself. The discipline that has emerged is to weight assertions by context, to version facts with timestamps and provenance, and — for high-stakes domains — to gate updates through explicit user confirmation. A pattern that has emerged under names like MemGPT is to give the agent explicit memory-management tools so the model itself decides what to save, retrieve, and forget. The win is that the model often knows things about which memories matter that no rules-based extractor would catch. The cost is that the model also gets things wrong, and a memory store curated by the model needs guardrails against runaway growth.

10.3 Surviving the context limit: compaction and structured notes

Even with episodic and semantic memory in place, the agent's current session still hits its window. The most common remedy is summarisation-based compaction: when context approaches sixty to eighty percent of the window, a background step summarises older turns and replaces them. The failure modes are summary drift (the gist survives but specific facts that turn out to matter are lost) and recursive smoothing (each pass summarises a summary, and cumulative loss is severe). The remedies are structured summarisation prompts that preserve named entities, decisions, and open questions, and summarising from originals when possible rather than from earlier summaries.

Tool result clearing evicts the bulk of tool returns after a few intervening turns, replacing them with brief notes like "queried users table, 47 rows returned, found user 12345." Structured note-taking requires the agent to maintain an authoritative notes file capturing the current goal, steps completed, steps remaining, and open questions — treated as the source of truth, not as a transcript. Externalisation moves produced artefacts to the filesystem or database with the context holding only references. The unifying principle is that the context window is for active work, not for archive. Larger windows make external storage more important, not less, because they enable longer sessions in which the externalisation architecture has more time to either work or fail.

Worth holding onto: long-horizon agents are not just longer short-horizon agents. They are a different engineering problem, with different failure modes — researcher, engineering, operations, and background patterns each compose the primitives differently. Make memory state inspectable in human-readable form, log every read and write, and test session resumption and high memory load as routine cases, not edge cases.

What Chapter 10 sets up

Chapters 9 and 10 together close Part IV with two complementary mental models: context as a finite budget within a single call, and memory as the architecture for selective remembering across sessions. What neither chapter contended with is adversarial pressure. Every memory write is a place an attacker can poison. Every tool call is a place an attacker can intercept. Every retrieved memory is a place an attacker can inject instructions the agent will treat as its own thoughts. The architectures of the last two chapters were designed for correctness and efficiency, not for survival under attack.

Next — Chapter 11: Attack Surfaces and Protocol Vulnerabilities. Confused Deputy, Token Passthrough, Session Hijacking, Capability Escalation, Unauthenticated Sampling, and the implicit trust propagation that makes context poisoning so hard to fix.

Want the full picture? The book walks the four canonical patterns — researcher, engineering, operations, background agents — with their characteristic failure modes, the checkpoint discipline that long-running coding agents have converged on, and the deletion architecture that separates a memory system that grows wiser with use from one that grows louder. View LLM Primer IV on Amazon →