Chapter 3 — Mathematical Tools for Language Models
Third post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The final preparation chapter — the moment we lay the tools out on the bench, in the order we'll use them.
The last chapter before the math turns on
Chapter 3 is brisk. It assumes you've made it through Chapter 1's entropy and Chapter 2's vocabulary, and now it stocks the workbench. Two slices of mathematics — probability and linear algebra — get exactly as much depth as the rest of the book will need.
Importantly, that is less than a textbook on either subject would give you, and more than a popular article would. The book takes the time to derive the moves it will actually use. It does not take the time to derive the moves it never will.
3.1 Probability and statistics in language modeling
The first section sharpens the probability of Chapter 1 into the specific operations the rest of Book II depends on.
You meet conditional probability again, but this time written in the form a model is actually computing: the probability of token x_t given all the tokens before it, parameterized by a function with weights θ. Maximum likelihood estimation shows up — the principle behind training, written as one clean optimization problem before any optimizer touches it.
The chapter then walks you through cross-entropy as the loss function. Not as a definition handed down from the sky, but as the natural consequence of asking: "if my model assigns probability p̂ to what actually happened, how surprised was it?" Average that surprise across the training data, and you've derived the loss yourself.
By the end of 3.1, the training objective is no longer a black box. It's a single line of probability theory.
3.2 Vector spaces, embeddings, and linear algebra intuition
The second section opens the geometric side of the book. Words, tokens, sentences, even attention scores — all of them eventually become points and directions in high-dimensional vector spaces. Section 3.2 introduces the linear algebra that makes that geometry usable.
The chapter takes a deliberate shortcut: it teaches by intuition first. A vector is a list of numbers. A vector space is "all the lists of numbers of a given length, plus the rules for adding and scaling them." Dot products measure alignment. Norms measure length. Matrix multiplication is a particularly orderly way of computing many dot products at once.
Then it shows what each of those means for language. Two word vectors are similar when their dot product is large. The famous "king − man + woman ≈ queen" example is unpacked carefully — as one of several legitimate but often overhyped consequences of training embeddings on enough text.
The worked example here is small: a four-word vocabulary, a hand-built embedding table, and a computation you can do on paper to see why "cat" and "dog" come out close while "cat" and "concrete" come out far. The same shape, scaled up, is what's running inside every embedding layer in every modern LLM.
The bridge to Part II
Chapter 3 ends by joining the two halves of itself. Probabilities are sums and products and logs. Embeddings are dot products and matrix multiplications. Attention, which arrives in the very next chapter, is exactly what you get when you let these two languages run into each other inside a single layer.
Specifically: attention takes vectors (the linear algebra), computes their similarities, runs them through a softmax to turn similarities into a probability distribution (the probability), and uses that distribution to take a weighted average of other vectors. Three of these moves were just introduced. Attention is the assembly.
What Chapter 3 sets up
The end of Chapter 3 is the end of Part I. The vocabulary is set, the object is named, the toolkit is sharpened, and the bench is clear. From here we move into the heart of the book: The Mathematics of Transformers.
Next — Chapter 4: Attention — The Core Mechanism. The first chapter of Part II, and the chapter the rest of the book builds on. We'll derive self-attention from intuition, look at the geometry of queries-keys-values, unpack multi-head structure and softmax, and end with a surprising perspective: attention as a kernel method.