Chapter 8 — How Models Learn
Eighth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The chapter where the book gets honest about what's still mysterious.
The chapter that begins Part III
Up to here, Book II has been building a clean machine: probability, attention, transformer blocks, efficiency. Chapter 8 is where the math gets uncomfortable. We turn from "how is this object computed" to "why does training it work at all?" — and the honest answer is that, even now, several pieces of that story are not fully understood.
This is one of my favorite chapters in the book to teach. It is also one of the most humbling.
8.1 Generalization in over-parameterized models
Classical statistical learning theory has a clear story: a model with more parameters than training examples will overfit. It will memorize the training set and fail on new data. This was the bedrock of the field for decades.
Modern LLMs have hundreds of billions of parameters and are trained on text that, when tokenized, may have only a few trillion examples. The ratio is comfortably over the line where classical theory predicts catastrophe. And yet they generalize. Often, they generalize better as they grow larger.
Section 8.1 lays out this puzzle and walks through the modern attempts to explain it. The double descent curve — the surprising observation that test error decreases, then increases, then decreases again as model size grows past the interpolation threshold — gets a clean derivation. So does the neural tangent kernel perspective, which approximates very wide networks as kernel methods (and reconnects, satisfyingly, with the kernel view of attention from Chapter 4).
8.2 Implicit bias of gradient-based optimization
Section 8.2 introduces a piece of the answer. When you train a neural network with stochastic gradient descent, you are not just minimizing the loss — you are minimizing it in a particular way, following a particular path through the loss landscape. And that path is biased.
The chapter shows that gradient descent has implicit preferences. Among all the parameter settings that fit the training data equally well, gradient descent tends to find ones with smaller weight norms, or flatter minima, or other properties that correlate with generalization. This is "implicit regularization" — the optimizer is helping you in ways you didn't ask for, and the math is starting to unpack why.
The section walks through the simplest case where implicit bias has been proven cleanly — linear models trained with logistic loss — and shows how the intuition extends to deep networks.
8.3 Scaling laws: data, parameters, and compute
Section 8.3 introduces the most practically important set of empirical results in the field. The scaling laws — Kaplan et al. in 2020, then Chinchilla in 2022 — describe how a transformer's loss decreases as you grow its parameters, its training data, and its compute budget. They describe it accurately enough to predict what a model not yet trained will achieve.
The chapter derives the form of these laws — power-law decay in each of the three variables — and shows what the empirical exponents imply. The famous Chinchilla result: for a given compute budget, you should grow parameters and tokens at roughly the same rate. Older models had been too large and undertrained. The math made the correction obvious.
The section is careful about one thing. Scaling laws describe loss, not capability. The relationship between low loss and the kinds of behaviors we care about — reasoning, instruction-following, code generation — is empirical and noisier than the loss curves themselves.
8.4 Open mathematical questions in LLM theory
The chapter closes with a clear-eyed list of what is not understood. Why do scaling laws have the exponents they have? When and why do new capabilities "emerge" with scale? What is the right notion of generalization for autoregressive language models? Why does in-context learning work? Why does fine-tuning a tiny number of parameters (LoRA) capture so much of what full fine-tuning does?
The chapter does not pretend to answer these. It says, here are the questions, here is what current theory can say, and here is what it cannot. For me, this is the section that makes Book II feel different from a textbook. It treats you as someone who deserves to know where the open ground is.
What Chapter 8 sets up
You finish Chapter 8 with a much sharper picture of training as a mathematical process — and with a list of honest questions the field is still answering. From here we turn from the theory of training to its engineering: how training is actually carried out at the scale of frontier models.
Next — Chapter 9: Training at Scale. The companion chapter to Chapter 8. How data preprocessing quietly shapes everything that follows. The mathematics of mini-batch learning, parallelism, and efficiency. And the question that turns out to be unexpectedly subtle: how do you keep a training run numerically stable across thousands of GPUs?