Chapter 12 — Real-World Applications of LLMs

Twelfth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The first chapter of Part IV — where the book turns from how models are built and measured to what they actually do, read through the lens of the math we now have.

The same math, different jobs

Chapter 12 has a particular shape. It does not introduce new mathematics. It revisits the applications most readers have already used — and asks, for each one, what the math from the previous eleven chapters implies about how that application works, why it sometimes fails, and where the room for improvement actually lies.

It is a short chapter on purpose. Once you can see the math underneath, the applications make more sense than any user-facing description could.

12.1 Text generation and summarization

Section 12.1 takes the two most familiar uses of LLMs and shows that they are, mathematically, the same task with different prompts.

Generation is sampling from the next-token distribution, one token at a time, conditioned on whatever context you've assembled. Summarization is generation where the context is a long document and the prompt asks for a shorter version. There is no separate "summarization architecture." The model is doing what it was always doing — predicting the next token — and the prompt is the only thing telling it what to predict.

This has consequences. It explains why summaries can hallucinate: the model has no internal mechanism that distinguishes "in the document" from "consistent with the document." It explains why temperature and top-p, which we rederived more carefully in Chapter 4, change the character of summaries as much as they change the character of fiction.

The section then introduces constrained decoding — the family of techniques that force the model to produce only outputs that match some structural rule. Valid JSON. A grammar. A regular expression. A function signature. The math is small but elegant: at each token, you mask the distribution to zero out the candidates that would violate the constraint, then sample from what remains. The same mechanism that lets the model improvise is what lets it be reined in.

And the section covers code, tools, and the agent loop — generation where the model's output is intercepted, run, and the result fed back into the context for the next step. This is the foundation that Book IV will spend a volume on. The math is in this chapter; the engineering is in the next book.

One line: there is no "summarization mode." There is only next-token prediction in a context that asks for a summary. Every property of summarization output traces back to that single fact.

12.2 Question answering, translation, and reasoning tasks

Section 12.2 takes three more applications that feel very different from the outside and shows that they are, again, the same machinery wearing different clothes.

Question answering is generation conditioned on a context (the question, possibly retrieved documents) where the most probable continuation is an answer. The chapter shows how the closed-book version (model knowledge only) and the open-book version (with retrieval, the topic of Book III) differ mathematically — what changes is the conditioning, not the model. The geometry of Chapter 3 — embedding queries and passages as vectors, finding nearest neighbors by inner product — is exactly what makes the open-book version possible.

Translation is generation where the context is a sentence in one language and the most probable continuation is the same sentence in another. The math is identical. What makes translation work is that the training data contained enough aligned bilingual text that the joint distribution over the two languages is well-represented.

Reasoning tasks are where the chapter gets careful. Chain-of-thought prompting — asking the model to write out its reasoning steps before answering — works because of a mathematical fact this book has been building toward: each token the model writes becomes part of the context for the next, so writing intermediate steps literally gives the model more computation to spend on each subsequent step. Reasoning is not a separate capability bolted onto generation. It is an emergent consequence of generation done at length, with the right structure.

The section closes by showing how the same tokens-and-context machinery generalizes beyond text — images, audio, video — once you can convert any modality into tokens. The mathematical reason multimodal models work is the same reason text generation works: the transformer does not know it is reading text in the first place.

Worth holding onto: the variety of LLM applications is the variety of contexts, not the variety of mechanisms. One forward pass, one next-token distribution, one sample. The diversity is in the prompt and the data — not in what the model is doing.

Why this chapter is short

Most books about LLMs spend their longest chapters on applications. This one spends a short chapter, because if you've followed the math, you do not need long descriptions. You can read about a new application and immediately ask the right questions: what is the context? what does the next-token distribution look like? what kinds of failure does that suggest? what would change about the answer if you changed temperature, the prompt, the retrieval source?

That is the test the book is trying to pass for you. Chapter 12 is where you check that it has.

What Chapter 12 sets up

You finish Chapter 12 with the same toolkit you finished Chapter 11 with — but now you have watched it explain four families of real applications. From here, the book turns to the limitations that are not solved by more math, more data, or more compute.

Next — Chapter 13: Limitations, Risks, and Open Challenges. The honest chapter. The energy and compute ceilings that constrain where the field can go. The biases that scale with the data. The ethical questions that math cannot answer alone. The list of problems waiting to be solved.

Want the full picture? The book walks through each application with a concrete worked example, including the failure modes the math predicts and the empirical fixes the field has converged on. View LLM Primer II on Amazon →