Chapter 8 — Architectural Deployment Layouts

Eighth post of the chapter-by-chapter walkthrough of LLM Primer IV: Designing AI Cognition with MCP. In which two systems with identical orchestration logic turn out to behave very differently in production, because the question "where does the model actually run" is answered three different ways and each answer sets a different latency, cost, and security ceiling.

Why this chapter exists

Chapters 6 and 7 specified how agents coordinate. They were silent on a question that decides as much of the system's real-world behaviour: where does each agent run, and how are language models, MCP servers, and tools arranged across the network? The architectural decisions in this chapter touch every request and every dollar. Three broadly distinct layouts have emerged in the MCP ecosystem through 2025 and into 2026, and none of the three is universally right. Each is optimised for a different combination of constraints, and the choice has consequences engineers should be able to articulate before they make it.

One line: the three layouts — reusable agent, strict purity, hybrid — trade encapsulation against transparency, and the right answer is the one whose strengths align with the constraint that actually binds the project.

8.1 The reusable AI agent: model packaged with the server

The first layout packages a language model together with an MCP server and exposes the combination as a single black-box capability. The client calls "review this pull request" the way it would call a tool; under the hood, a full agentic loop runs on the server side, invoking the model, calling internal tools, and returning only the final output. The pattern emerged when organisations built specialised agents — research, code-review, financial-analysis — they wanted to expose to many different host applications without each one having to understand the internals.

The strengths are real. Encapsulation lets the agent's authors swap models or restructure the orchestration without any change visible to clients. Reusability across host applications means a single agent serves Claude Code, an enterprise app, and a customer-facing chatbot through the same interface. The costs are equally real. Opacity makes debugging hard: when the agent makes a wrong decision, the client cannot see why. Latency stacks because the server-side loop runs on every call. And double-budgeting is structural: the agent operator pays for its model invocations and the client pays for its own, and the total per-interaction cost can be higher than either side initially realises.

The multi-tenant variant — one agent process serving many client organisations — substantially improves operational economics by amortising the fixed costs across tenants. It also introduces cross-tenant data leakage as a first-class failure mode. Several disclosed incidents in 2025 involved exactly this, and operators should design tenancy into the runtime rather than relying on code review to catch leaks between requests.

8.2 Strict MCP purity: the model in the client

The second layout puts the language model strictly in the client. MCP servers expose only data and tools — no inference. The host runs the model, makes the orchestration decisions, and calls servers to gather context and execute actions. The motivation is the protocol's original design rationale: MCP exists to solve the N-times-M integration problem by providing a uniform interface between models and tools, and servers that run their own models reintroduce the problem the protocol was meant to solve.

The strengths are control and transparency. Every model call is observable to the client's instrumentation. The client picks the model per request, can fall back across providers, and bears all model cost directly — no double-budgeting because there is no second model in the chain. For regulated industries where audit logs at the client must capture the full reasoning chain, strict purity satisfies the requirement without ambiguity.

The costs are equally clear. Every piece of intelligence has to be reimplemented by every client, which is a real burden for organisations with many host applications. Some capabilities — multi-step research, iterative refinement — are awkward to factor as single-shot tools, and exposing them either pushes complex orchestration into the client or quietly violates the purity by letting the server run its own model. And the client has to be capable of running orchestration, which in practice means leaning on Strands, LangChain, Semantic Kernel, or the Microsoft Agent Framework rather than rolling the loop from scratch.

8.3 Hybrid: client-side orchestration with server-side execution

The third layout splits the difference. The client orchestrates and runs primary inference; certain MCP servers run their own model invocations for specialised subtasks. The clearest example is a long-running deep-research tool. A client that needs to fold a research result into a broader workflow does not want to manage every step of the research itself; it wants to call "research X and tell me what you found" and receive a synthesis. The research is multi-step, multi-source, and benefits from its own internal orchestration.

The hybrid pattern works when the boundary is drawn at a meaningful seam — a subtask with clear inputs and outputs, self-contained enough that server-side intelligence can complete it without client coordination, and specialised enough that running it as a black box beats running it through the general orchestrator. Research, code analysis, document synthesis fit the profile. General conversation does not.

The cost is architectural complexity. There are now two places where intelligence runs, observability has to span the boundary, and cost attribution sits between the two pure patterns. A useful discipline is to limit the number of intelligent servers and keep each focused on a well-defined capability. Two or three is operable; twelve becomes a federation no team understands. Hybrid deployments also tend to evolve toward either pure end of the spectrum over time, and architects should be honest about whether their hybrid is a stable design or a transitional state.

Worth holding onto: read the three layouts as answers to four binding constraints — latency, cost, operational complexity, security posture. Strict purity gives the lowest minimum latency and the cleanest cost attribution. Reusable agents amortise expertise across host applications. Hybrid encapsulates intelligence-intensive subtasks without giving up overall orchestration. The right layout is the one whose strengths align with the constraint that actually binds, not the one whose name is most fashionable.

What Chapter 8 sets up

The three chapters of Part III have moved through the design space of multi-agent systems from the perspective of mechanism. What they do not yet cover is the substrate underneath: the context every model invocation receives and the memory that persists across invocations. An agent's effectiveness depends on what it can see when it acts and what it can recall about what it has done before. The context window is finite. The memory architecture has to be designed. The patterns of attention budget management, scratchpad use, episodic and semantic memory are the substrate that turns a coordinated set of agents into a system that does useful work over hours, days, or longer.

Next — Chapter 9: Managing the Attention Budget. Why a million-token window is a ceiling value rather than an operating point, what eats the budget, and how MCP, RAG, and fine-tuning each fit a different shape of gap.

Want the full picture? The book walks each layout with deployment topologies, the multi-tenant agent failure modes that have appeared in disclosed 2025 incidents, the migration paths between layouts, and the operational realities — observability spanning the MCP boundary, cost reporting aligned with architecture — that determine whether a deployment survives contact with production. View LLM Primer IV on Amazon →