Chapter 3 — Mathematical Tools for Language Models

Published on: 2026-03-05 Last updated on: 2026-06-05 Version: 3
Chapter 3 — Mathematical Tools for Language Models

Chapter 3 — Mathematical Tools for Language Models

Third post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The final preparation chapter — the moment we lay the tools out on the bench, in the order we'll use them.


The last chapter before the math turns on

Chapter 3 is brisk. It assumes you've made it through Chapter 1's entropy and Chapter 2's vocabulary, and now it stocks the workbench. Two slices of mathematics — probability and linear algebra — get exactly as much depth as the rest of the book will need.

Importantly, that is less than a textbook on either subject would give you, and more than a popular article would. The book takes the time to derive the moves it will actually use. It does not take the time to derive the moves it never will.

3.1 Probability and statistics in language modeling

The first section sharpens the probability of Chapter 1 into the specific operations the rest of Book II depends on.

You meet conditional probability again, but this time written in the form a model is actually computing: the probability of token x_t given all the tokens before it, parameterized by a function with weights θ. Maximum likelihood estimation shows up — the principle behind training, written as one clean optimization problem before any optimizer touches it.

The chapter then walks you through cross-entropy as the loss function. Not as a definition handed down from the sky, but as the natural consequence of asking: "if my model assigns probability p̂ to what actually happened, how surprised was it?" Average that surprise across the training data, and you've derived the loss yourself.

By the end of 3.1, the training objective is no longer a black box. It's a single line of probability theory.

One line: training an LLM is maximum likelihood estimation on text. The vast machinery is in service of that simple statistical principle.

3.2 Vector spaces, embeddings, and linear algebra intuition

The second section opens the geometric side of the book. Words, tokens, sentences, even attention scores — all of them eventually become points and directions in high-dimensional vector spaces. Section 3.2 introduces the linear algebra that makes that geometry usable.

The chapter takes a deliberate shortcut: it teaches by intuition first. A vector is a list of numbers. A vector space is "all the lists of numbers of a given length, plus the rules for adding and scaling them." Dot products measure alignment. Norms measure length. Matrix multiplication is a particularly orderly way of computing many dot products at once.

Then it shows what each of those means for language. Two word vectors are similar when their dot product is large. The famous "king − man + woman ≈ queen" example is unpacked carefully — as one of several legitimate but often overhyped consequences of training embeddings on enough text.

The worked example here is small: a four-word vocabulary, a hand-built embedding table, and a computation you can do on paper to see why "cat" and "dog" come out close while "cat" and "concrete" come out far. The same shape, scaled up, is what's running inside every embedding layer in every modern LLM.

The bridge to Part II

Chapter 3 ends by joining the two halves of itself. Probabilities are sums and products and logs. Embeddings are dot products and matrix multiplications. Attention, which arrives in the very next chapter, is exactly what you get when you let these two languages run into each other inside a single layer.

Specifically: attention takes vectors (the linear algebra), computes their similarities, runs them through a softmax to turn similarities into a probability distribution (the probability), and uses that distribution to take a weighted average of other vectors. Three of these moves were just introduced. Attention is the assembly.

Worth holding onto: if you can comfortably compute a dot product, take a log, and read a probability table, you have the equipment to follow every derivation in this book. Chapter 3 is the chapter where I make sure of that.

What Chapter 3 sets up

The end of Chapter 3 is the end of Part I. The vocabulary is set, the object is named, the toolkit is sharpened, and the bench is clear. From here we move into the heart of the book: The Mathematics of Transformers.


Next — Chapter 4: Attention — The Core Mechanism. The first chapter of Part II, and the chapter the rest of the book builds on. We'll derive self-attention from intuition, look at the geometry of queries-keys-values, unpack multi-head structure and softmax, and end with a surprising perspective: attention as a kernel method.

Want the full picture? The book unpacks every move in this chapter with a worked example and a diagram, and keeps a math cheat sheet in the back appendix so you never have to flip far for a symbol. View LLM Primer II on Amazon →

SHO
SHO
CTO of Receipt Roller Inc., he builds innovative AI solutions and writes to make large language models more understandable, sharing both practical uses and behind-the-scenes insights.