Chapter 9 — Performance, Scaling, and Costs
This is Part 9 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we walked through the application patterns. Today we talk about what those applications cost to actually run — the operational realities that determine whether an LLM feature ships or quietly dies in pilot.
Bigger is not always better
The instinct that larger models are always more capable is mostly correct — and mostly misleading. Larger models do, on average, perform better on most benchmarks. But the gains shrink as model size grows. A model with twice as many parameters is rarely twice as good. Often it's only marginally better, while costing several times more to run.
For many real applications, a smaller model with good prompting, good retrieval, and good evaluation outperforms a larger model that's been dropped into a less-engineered system. This is the single most important practical insight in Chapter 9, and the book takes time to defend it because it cuts against the marketing narrative of every frontier lab.
Latency and throughput pull in opposite directions
Two metrics dominate LLM operations. Latency is the time from request to first response. Throughput is how many requests per second the system can handle.
The trade-off between them is structural. To get high throughput, you batch requests together — running many of them through the GPU in parallel. To get low latency, you process each request individually as soon as it arrives. Choose one and you sacrifice the other.
Most real systems land somewhere in the middle using a technique called continuous batching, which adds new requests to a batch already in flight. Streaming generation — returning tokens to the user as they're produced rather than waiting for the full response — helps perceived latency even when the actual generation time is unchanged. The book walks through what design choices are appropriate for what kinds of products.
The actual cost equation
Running an LLM is not like running normal software, and one of the most expensive lessons in deploying AI is learning this the hard way. With traditional software, once you've built the system, additional users are nearly free; the server runs whether one user or one thousand are hitting it. With an LLM, every request costs real computing power. Costs scale linearly with usage.
The dominant cost drivers are easy to enumerate. Longer inputs cost more (the attention computation scales with input length). Longer outputs cost more (each output token requires another forward pass through the model). Larger models cost more (more parameters means more FLOPs per token). Reasoning models cost more (they generate many intermediate tokens before the final answer). These don't add; they multiply.
The book includes a worked-example cost table showing how an application's monthly bill changes across realistic scenarios — small chatbot, mid-tier RAG, agentic system with reasoning. The exact numbers move; the shape of the curve is durable.
Quantization, in plain language
One of the most powerful engineering tricks for reducing cost is quantization. The model's parameters are normally stored as floating-point numbers — often 16-bit or 32-bit precision. Quantization stores them with less precision, often 8 bits or even 4 bits per number, with carefully chosen scaling factors that limit the information loss.
The effect is dramatic. A quantized model uses much less memory, which means it fits on smaller hardware and runs faster (because memory bandwidth, not raw computation, is often the bottleneck). The quality loss is small if done well — typically a few percentage points on benchmarks, sometimes imperceptible in practice.
Several related techniques — pruning, distillation, sparsity — push efficiency further. Distillation in particular is how you get a small model that mimics a much larger one. The book covers each technique and when each is appropriate.
Edge and on-device deployment
Most production LLMs run in cloud data centers. Some applications can't. Real-time voice assistants, applications with hard privacy constraints, devices that need to work without network connectivity — these benefit from running a model directly on the user's hardware.
Edge deployment is its own engineering discipline. Memory is constrained. Power is precious. The available compute is far smaller than a data center's. The book walks through how heavy compression, careful model selection, and hybrid architectures (small local model for routine work, cloud model for hard cases) make this possible.
What Chapter 9 sets up
By the end of Chapter 9, you can model the cost and performance of any proposed LLM application before you build it. You know which design choices matter most for cost, which for latency, and which trade-offs are negotiable versus fundamental. You can read vendor pricing pages and understand what's actually being charged for.
This sets up the natural next question — once you can run a system at scale, how do you make sure you should?
Next up — Chapter 10: Safety, Ethics, & Trust. Tomorrow we look at hallucinations, bias, prompt safety, explainability, and the governance frameworks that turn LLM applications into responsible products. Including the controls that the high-stakes deployments actually use.