Chapter 9 — Training at Scale
Ninth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The companion to Chapter 8 — and the chapter where the math meets the engineering of a real training run.
Where Chapter 8 left us
Chapter 8 told the theory side of training: why it works, where it's mysterious, what scaling laws predict. Chapter 9 is the part most papers gloss over and most engineers spend their careers on — what actually happens inside the cluster while a model is being trained.
The chapter has three sections, and each one is about a kind of math that's invisible from a distance and load-bearing up close.
9.1 Data preprocessing and its mathematical consequences
Section 9.1 opens with a quiet claim: the choices made in data preprocessing affect the final model at least as much as the choice of architecture. Most discussions skip past this, treating preprocessing as plumbing. Chapter 9 treats it as math.
The chapter walks through the major decisions and what each one mathematically implies. Tokenization — byte-pair encoding, SentencePiece, the choice of vocabulary size — shapes the distribution the model is trying to learn. Deduplication changes which patterns are reinforced. Filtering by quality changes the support of the training distribution. Mixing data sources in different ratios changes which distributions the model converges toward.
The section gives one extended example: the difference between training on raw web text and training on deduplicated, quality-filtered text. The same number of tokens. The same architecture. Two very different models at the end.
9.2 Mini-batch learning, parallelism, and efficiency
Section 9.2 is about how you train when "one example at a time" is too slow and "the whole dataset at once" doesn't fit anywhere. The answer is mini-batches — and the math of mini-batches is more interesting than it looks.
The chapter derives the relationship between batch size and gradient noise. Smaller batches give noisier gradient estimates, which (counterintuitively) can help generalization through a kind of implicit regularization. Larger batches give cleaner gradients and let you use higher learning rates, up to a point. There is a critical batch size beyond which adding more parallelism just wastes compute. The section walks through this tradeoff with the math made explicit.
From there, the chapter steps up to data parallelism, model parallelism, pipeline parallelism, and tensor parallelism — the four primary axes along which a real training run is split across GPUs. Each one has a clean mathematical description. Each one has a non-obvious cost. The combination, in modern training systems, is called "3D parallelism" or "4D parallelism" and gets a full diagram in the book.
9.3 Numerical precision, stability, and large-scale optimization
Section 9.3 is the most technical of the three, and the most quietly important. Modern training runs use mixed precision — most computation happens in bfloat16 or float16, with selected operations promoted to float32. The reasons are clean: lower precision is faster, fits more in memory, and uses less power. The risks are clean too: lower precision can lose enough information that training silently diverges.
The chapter walks through the mathematics of floating point representation, why bfloat16 (with its wider exponent) tends to be safer than float16 (with its wider mantissa), and the specific operations — softmax, layer normalization, loss accumulation — where precision matters enough to keep float32.
The section also covers gradient scaling, loss scaling, and the optimizer states. Adam keeps two moving averages per parameter; in mixed precision those moving averages can underflow, and modern systems have to handle that carefully. A real training run that doesn't think about these things will, on a long enough timeline, simply break.
What Chapter 9 sets up
You finish Chapter 9 with a much clearer sense of what's actually happening during pretraining — and why the engineering and the math cannot be separated at this scale. That closes Part III. From here, the book pulls back to applications, limitations, and what's still ahead.
Next — Chapter 10: Post-Training and Alignment Mathematics. The chapter where a brilliant but feral next-word predictor is civilized into a helpful assistant — supervised fine-tuning, reward modeling, RLHF on a KL leash, and the elegant DPO derivation that collapses the whole pipeline into one supervised loss.