Chapter 9 — Performance, Scaling, and Costs: The Real Engineering Trade-offs

Published on: 2026-02-26 Last updated on: 2026-06-05 Version: 3
Chapter 9 — Performance, Scaling, and Costs: The Real Engineering Trade-offs

Chapter 9 — Performance, Scaling, and Costs

This is Part 9 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we walked through the application patterns. Today we talk about what those applications cost to actually run — the operational realities that determine whether an LLM feature ships or quietly dies in pilot.


Bigger is not always better

The instinct that larger models are always more capable is mostly correct — and mostly misleading. Larger models do, on average, perform better on most benchmarks. But the gains shrink as model size grows. A model with twice as many parameters is rarely twice as good. Often it's only marginally better, while costing several times more to run.

For many real applications, a smaller model with good prompting, good retrieval, and good evaluation outperforms a larger model that's been dropped into a less-engineered system. This is the single most important practical insight in Chapter 9, and the book takes time to defend it because it cuts against the marketing narrative of every frontier lab.

Key idea: Match model size to the actual problem. A smaller model that's correctly engineered into your system almost always beats a frontier-tier model running in a generic pipeline.

Latency and throughput pull in opposite directions

Two metrics dominate LLM operations. Latency is the time from request to first response. Throughput is how many requests per second the system can handle.

The trade-off between them is structural. To get high throughput, you batch requests together — running many of them through the GPU in parallel. To get low latency, you process each request individually as soon as it arrives. Choose one and you sacrifice the other.

Most real systems land somewhere in the middle using a technique called continuous batching, which adds new requests to a batch already in flight. Streaming generation — returning tokens to the user as they're produced rather than waiting for the full response — helps perceived latency even when the actual generation time is unchanged. The book walks through what design choices are appropriate for what kinds of products.

The actual cost equation

Running an LLM is not like running normal software, and one of the most expensive lessons in deploying AI is learning this the hard way. With traditional software, once you've built the system, additional users are nearly free; the server runs whether one user or one thousand are hitting it. With an LLM, every request costs real computing power. Costs scale linearly with usage.

The dominant cost drivers are easy to enumerate. Longer inputs cost more (the attention computation scales with input length). Longer outputs cost more (each output token requires another forward pass through the model). Larger models cost more (more parameters means more FLOPs per token). Reasoning models cost more (they generate many intermediate tokens before the final answer). These don't add; they multiply.

The book includes a worked-example cost table showing how an application's monthly bill changes across realistic scenarios — small chatbot, mid-tier RAG, agentic system with reasoning. The exact numbers move; the shape of the curve is durable.

Quantization, in plain language

One of the most powerful engineering tricks for reducing cost is quantization. The model's parameters are normally stored as floating-point numbers — often 16-bit or 32-bit precision. Quantization stores them with less precision, often 8 bits or even 4 bits per number, with carefully chosen scaling factors that limit the information loss.

The effect is dramatic. A quantized model uses much less memory, which means it fits on smaller hardware and runs faster (because memory bandwidth, not raw computation, is often the bottleneck). The quality loss is small if done well — typically a few percentage points on benchmarks, sometimes imperceptible in practice.

Several related techniques — pruning, distillation, sparsity — push efficiency further. Distillation in particular is how you get a small model that mimics a much larger one. The book covers each technique and when each is appropriate.

Edge and on-device deployment

Most production LLMs run in cloud data centers. Some applications can't. Real-time voice assistants, applications with hard privacy constraints, devices that need to work without network connectivity — these benefit from running a model directly on the user's hardware.

Edge deployment is its own engineering discipline. Memory is constrained. Power is precious. The available compute is far smaller than a data center's. The book walks through how heavy compression, careful model selection, and hybrid architectures (small local model for routine work, cloud model for hard cases) make this possible.

Important: The cost shape of inference-time-scaling reasoning models is different from traditional LLMs. Per request, costs can vary by an order of magnitude depending on how much "thinking" the model decides to do. Budgeting and rate-limiting need to account for this variance.

What Chapter 9 sets up

By the end of Chapter 9, you can model the cost and performance of any proposed LLM application before you build it. You know which design choices matter most for cost, which for latency, and which trade-offs are negotiable versus fundamental. You can read vendor pricing pages and understand what's actually being charged for.

This sets up the natural next question — once you can run a system at scale, how do you make sure you should?


Next up — Chapter 10: Safety, Ethics, & Trust. Tomorrow we look at hallucinations, bias, prompt safety, explainability, and the governance frameworks that turn LLM applications into responsible products. Including the controls that the high-stakes deployments actually use.

Want the full picture? The book includes a realistic cost-modeling table and a careful treatment of when each efficiency technique applies. Grab LLM Primer I on Amazon →

SHO
SHO
CTO of Receipt Roller Inc., he builds innovative AI solutions and writes to make large language models more understandable, sharing both practical uses and behind-the-scenes insights.