Chapter 9 — Training at Scale

Ninth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The companion to Chapter 8 — and the chapter where the math meets the engineering of a real training run.

Where Chapter 8 left us

Chapter 8 told the theory side of training: why it works, where it's mysterious, what scaling laws predict. Chapter 9 is the part most papers gloss over and most engineers spend their careers on — what actually happens inside the cluster while a model is being trained.

The chapter has three sections, and each one is about a kind of math that's invisible from a distance and load-bearing up close.

9.1 Data preprocessing and its mathematical consequences

Section 9.1 opens with a quiet claim: the choices made in data preprocessing affect the final model at least as much as the choice of architecture. Most discussions skip past this, treating preprocessing as plumbing. Chapter 9 treats it as math.

The chapter walks through the major decisions and what each one mathematically implies. Tokenization — byte-pair encoding, SentencePiece, the choice of vocabulary size — shapes the distribution the model is trying to learn. Deduplication changes which patterns are reinforced. Filtering by quality changes the support of the training distribution. Mixing data sources in different ratios changes which distributions the model converges toward.

The section gives one extended example: the difference between training on raw web text and training on deduplicated, quality-filtered text. The same number of tokens. The same architecture. Two very different models at the end.

One line: the loss function is what the model optimizes; the preprocessed data distribution is what defines the problem. Both are choices. Most people only notice the first one.

9.2 Mini-batch learning, parallelism, and efficiency

Section 9.2 is about how you train when "one example at a time" is too slow and "the whole dataset at once" doesn't fit anywhere. The answer is mini-batches — and the math of mini-batches is more interesting than it looks.

The chapter derives the relationship between batch size and gradient noise. Smaller batches give noisier gradient estimates, which (counterintuitively) can help generalization through a kind of implicit regularization. Larger batches give cleaner gradients and let you use higher learning rates, up to a point. There is a critical batch size beyond which adding more parallelism just wastes compute. The section walks through this tradeoff with the math made explicit.

From there, the chapter steps up to data parallelism, model parallelism, pipeline parallelism, and tensor parallelism — the four primary axes along which a real training run is split across GPUs. Each one has a clean mathematical description. Each one has a non-obvious cost. The combination, in modern training systems, is called "3D parallelism" or "4D parallelism" and gets a full diagram in the book.

9.3 Numerical precision, stability, and large-scale optimization

Section 9.3 is the most technical of the three, and the most quietly important. Modern training runs use mixed precision — most computation happens in bfloat16 or float16, with selected operations promoted to float32. The reasons are clean: lower precision is faster, fits more in memory, and uses less power. The risks are clean too: lower precision can lose enough information that training silently diverges.

The chapter walks through the mathematics of floating point representation, why bfloat16 (with its wider exponent) tends to be safer than float16 (with its wider mantissa), and the specific operations — softmax, layer normalization, loss accumulation — where precision matters enough to keep float32.

The section also covers gradient scaling, loss scaling, and the optimizer states. Adam keeps two moving averages per parameter; in mixed precision those moving averages can underflow, and modern systems have to handle that carefully. A real training run that doesn't think about these things will, on a long enough timeline, simply break.

Worth holding onto: training a frontier model is not "running the same loop bigger." Every increment of scale exposes a new failure mode. The math of Chapter 9 is the math of staying alive at that scale.

What Chapter 9 sets up

You finish Chapter 9 with a much clearer sense of what's actually happening during pretraining — and why the engineering and the math cannot be separated at this scale. That closes Part III. From here, the book pulls back to applications, limitations, and what's still ahead.

Next — Chapter 10: Post-Training and Alignment Mathematics. The chapter where a brilliant but feral next-word predictor is civilized into a helpful assistant — supervised fine-tuning, reward modeling, RLHF on a KL leash, and the elegant DPO derivation that collapses the whole pipeline into one supervised loss.

Want the full picture? The book walks through the precise math of each parallelism strategy, derives the critical batch size formula, and includes a clean explanation of mixed-precision training that has saved real engineers from real production incidents. View LLM Primer II on Amazon →