Chapter 9 — Managing the Attention Budget

Ninth post of the chapter-by-chapter walkthrough of LLM Primer IV: Designing AI Cognition with MCP. In which a million-token context window turns out to be a ceiling value rather than an operating point, and a remarkable share of "the model got worse" turns out to be "the model got buried."

Why this chapter exists

A context window looks like free space. It is not. Every token an agent reads costs latency, money, and — less obviously but more importantly — quality. The illusion that a million-token window means "fit everything" is one of the most expensive misreadings in current practice, and it accounts for a large share of the production failures that get diagnosed as model regressions. The model did not get worse. It got buried. This chapter is about treating context as a finite budget rather than a free resource: what eats the budget, what alternatives exist when the budget is the wrong tool, and how to land in the productive zone where the agent has exactly what it needs and nothing more.

One line: context is a cost center, not a free input — and a team that adds tools without removing them, accumulates history without compaction, and stuffs every retrieved chunk into the window in the hope that more can only help is operating in the part of the curve where each addition is making things worse.

9.1 Context rot and the non-linear cliff

The relationship between context length and quality is not linear. Doubling the prompt does not halve quality; past a point it more than halves it. The technical name that has stuck — context rot — is informal but accurate. The classic Stanford study by Liu and colleagues showed that models asked to find information in a list of documents performed dramatically worse when the relevant document sat in the middle than when it sat at either end. The U-shaped curve has been reproduced across model families and context lengths. The middle of a long prompt is, in a meaningful sense, attentionally cheaper than the boundaries, even though the architecture treats every position identically.

The "needle in a haystack" benchmarks that became standard in 2023 and 2024 initially seemed to refute this picture — near-perfect retrieval at 100K, 200K, even 1M tokens. The more careful follow-up work showed the benchmarks were too easy. A conspicuous needle in a homogeneous haystack is a different problem from finding a relevant fact buried among twenty topically related distractors. MCP-Universe and BIG-Bench-Long, released in late 2025, built in that adversarial structure, and the numbers are sobering: at 100K tokens, frontier models lose ten to twenty points compared to the same task at 8K, and at 500K the gap can reach forty.

There is a second form of rot specific to MCP agents. As tools accumulate in the system prompt, the model's accuracy at selecting the right tool degrades. MCP-Universe showed tool-selection accuracy dropping from roughly ninety percent with five tools to below sixty with forty. Practitioners now call this tool-loadout rot, and it is the single most common cause of "the agent got dumber after we added more capabilities." The mechanism is the same in both cases: attention is finite, and as the prompt grows, the share each token receives shrinks.

9.2 Three answers to the same question: MCP, RAG, fine-tuning

When a model lacks the knowledge it needs, there are three architectural answers, and confusing one for another is the cause of a remarkable share of misallocated effort. MCP fits when the knowledge is operational — current inventory, today's calendar, the status of a build. These have an authoritative source, change continuously, and no pre-loaded context can keep them current. The win is not just freshness but accountability: when the model says "the build is green," the user can ask "according to what" and the answer is "the build server, queried at this timestamp."

RAG fits when the knowledge is documentary — a corpus too large for the window but stable enough that a retrieval index is feasible. Internal docs, support articles, contracts, large codebases. Volume III of this series was entirely about the engineering of RAG and remains the canonical reference. Fine-tuning fits when the gap is behaviour — consistent format, particular voice, reliable refusal of a class of request. The misallocation that recurs in industry is using fine-tuning to inject factual knowledge that changes, which produces a model that is briefly impressive and then progressively wrong as the world drifts from its frozen snapshot.

The three are not exclusive. A mature agent typically combines them: fine-tuning for behaviour, RAG for documentary knowledge, MCP for operational knowledge. The framing that helps is the right substrate for the right freshness requirement. Behaviour is stable on the scale of model generations; bake it into weights. Documentary knowledge changes on the scale of days; index it. Operational knowledge changes on the scale of seconds; reach for it through tools. Architectures that mismatch the substrate — frozen weights for fast-changing facts, retrieval indexes for live state — pay a cost in correctness, latency, or both.

9.3 The Goldilocks Zone: enough context, not too much

The day-to-day question is how much context to pass on each call. The zone in the middle is narrower than most teams initially assume. The most consequential lever is the system prompt. A good one is short, specific, stable. A bad one is the defensive prompt that grows by accretion, with a clause added every time the model misbehaves, until it is a thousand-word rules document the model can no longer reliably follow. Teams that audit quarterly with explicit removal as a goal end up with prompts that are shorter than a year earlier and produce better behaviour.

The second lever is the tool roster. The corrective to tool-loadout rot is progressive disclosure: register a small number of high-level tools and let the model drill into specifics through a discovery tool. Forty narrow tools become four broad ones with internal dispatch, and tool-selection accuracy recovers most of what was lost. The third lever is conversation history — compact from turn one, not at ninety percent of window capacity. The fourth is tool results: return the fields the model needs, not the entire row. The discipline is deliberate inclusion: for every element, the team should be able to answer "what would happen if this were not there." If the answer is "the agent would behave the same," it should be removed.

Worth holding onto: context is no longer a place to put things; it is a place to spend things. Measure tokens spent per role, budget at design time rather than debug time, run quality regressions across context lengths, treat prefix stability as a cache-discipline requirement, and put stable content first and variable content last. The disciplines that make a single inference call succeed are the same disciplines that make a long-running session sustainable.

What Chapter 9 sets up

This chapter framed context as a finite budget within a single inference call. What it did not cover is the question of time. An agent that runs for thirty seconds has a budget problem that fits in a single window. An agent that runs for thirty minutes, three hours, three days has a memory problem that no window of any practical size can hold. The strategies for that scale of work are different in kind, not just in degree.

Next — Chapter 10: Long-Horizon Task Memory. Short-term mechanisms through sliding windows and ReAct scratchpads, long-term mechanisms through episodic vectors and semantic stores, and the compaction techniques that let an agent operate over hours and days.

Want the full picture? The book walks the MCP-Universe and BIG-Bench-Long numbers in detail, develops the cost and latency signatures of each substrate, and includes seven operational practices — from per-role token telemetry to position-aware prompt construction to per-call budget allocation across the agent loop — that production teams have converged on. View LLM Primer IV on Amazon →