Chapter 8 — How Models Learn

Eighth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The chapter where the book gets honest about what's still mysterious.

The chapter that begins Part III

Up to here, Book II has been building a clean machine: probability, attention, transformer blocks, efficiency. Chapter 8 is where the math gets uncomfortable. We turn from "how is this object computed" to "why does training it work at all?" — and the honest answer is that, even now, several pieces of that story are not fully understood.

This is one of my favorite chapters in the book to teach. It is also one of the most humbling.

8.1 Generalization in over-parameterized models

Classical statistical learning theory has a clear story: a model with more parameters than training examples will overfit. It will memorize the training set and fail on new data. This was the bedrock of the field for decades.

Modern LLMs have hundreds of billions of parameters and are trained on text that, when tokenized, may have only a few trillion examples. The ratio is comfortably over the line where classical theory predicts catastrophe. And yet they generalize. Often, they generalize better as they grow larger.

Section 8.1 lays out this puzzle and walks through the modern attempts to explain it. The double descent curve — the surprising observation that test error decreases, then increases, then decreases again as model size grows past the interpolation threshold — gets a clean derivation. So does the neural tangent kernel perspective, which approximates very wide networks as kernel methods (and reconnects, satisfyingly, with the kernel view of attention from Chapter 4).

One line: classical learning theory predicts that big models should overfit. They don't. Why they don't is one of the most active open questions in machine learning.

8.2 Implicit bias of gradient-based optimization

Section 8.2 introduces a piece of the answer. When you train a neural network with stochastic gradient descent, you are not just minimizing the loss — you are minimizing it in a particular way, following a particular path through the loss landscape. And that path is biased.

The chapter shows that gradient descent has implicit preferences. Among all the parameter settings that fit the training data equally well, gradient descent tends to find ones with smaller weight norms, or flatter minima, or other properties that correlate with generalization. This is "implicit regularization" — the optimizer is helping you in ways you didn't ask for, and the math is starting to unpack why.

The section walks through the simplest case where implicit bias has been proven cleanly — linear models trained with logistic loss — and shows how the intuition extends to deep networks.

8.3 Scaling laws: data, parameters, and compute

Section 8.3 introduces the most practically important set of empirical results in the field. The scaling laws — Kaplan et al. in 2020, then Chinchilla in 2022 — describe how a transformer's loss decreases as you grow its parameters, its training data, and its compute budget. They describe it accurately enough to predict what a model not yet trained will achieve.

The chapter derives the form of these laws — power-law decay in each of the three variables — and shows what the empirical exponents imply. The famous Chinchilla result: for a given compute budget, you should grow parameters and tokens at roughly the same rate. Older models had been too large and undertrained. The math made the correction obvious.

The section is careful about one thing. Scaling laws describe loss, not capability. The relationship between low loss and the kinds of behaviors we care about — reasoning, instruction-following, code generation — is empirical and noisier than the loss curves themselves.

8.4 Open mathematical questions in LLM theory

The chapter closes with a clear-eyed list of what is not understood. Why do scaling laws have the exponents they have? When and why do new capabilities "emerge" with scale? What is the right notion of generalization for autoregressive language models? Why does in-context learning work? Why does fine-tuning a tiny number of parameters (LoRA) capture so much of what full fine-tuning does?

The chapter does not pretend to answer these. It says, here are the questions, here is what current theory can say, and here is what it cannot. For me, this is the section that makes Book II feel different from a textbook. It treats you as someone who deserves to know where the open ground is.

Worth holding onto: Book II is rigorous because rigor is where understanding lives. But rigor about a partly understood phenomenon also means being precise about which pieces are not understood. Chapter 8 is that precision.

What Chapter 8 sets up

You finish Chapter 8 with a much sharper picture of training as a mathematical process — and with a list of honest questions the field is still answering. From here we turn from the theory of training to its engineering: how training is actually carried out at the scale of frontier models.

Next — Chapter 9: Training at Scale. The companion chapter to Chapter 8. How data preprocessing quietly shapes everything that follows. The mathematics of mini-batch learning, parallelism, and efficiency. And the question that turns out to be unexpectedly subtle: how do you keep a training run numerically stable across thousands of GPUs?

Want the full picture? The book derives double descent with a clean small-model example, walks through the Chinchilla compute-optimal scaling argument in full, and closes Chapter 8 with a list of open problems that doubles as a research roadmap. View LLM Primer II on Amazon →