Chapter 5 — Training Large Models

This is Part 5 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we opened up the Transformer. Today we look at what it takes to actually fill in the billions of numerical knobs inside it — the process that turns a randomly-initialized architecture into a usable language model.

What "training" actually means

It's easy to skim over the word "training" and miss what it refers to. Training a large language model is the process of slowly adjusting every one of its parameters — billions of numbers — so that, on the training data, the model's next-token predictions get better and better.

The arithmetic of each individual adjustment is small. You feed in some text. The model predicts the next token. You compare the prediction to the actual next token. You compute a number that captures how wrong the prediction was (this is the loss). You compute how each parameter contributed to that wrongness (this is the gradient). You nudge each parameter a tiny amount in the direction that would have produced a better prediction.

Repeat that loop billions of times, on trillions of tokens, on tens of thousands of accelerator chips operating in parallel, for several months — and you have a frontier model. There is no trick to it, conceptually. The difficulty is in the engineering.

Key idea: Training is one tiny update, repeated unfathomably many times. Every impressive thing a model can do is the cumulative result of those updates. There is no magic step.

The data pipeline is half the model

One of the most underrated facts about modern LLMs is how much of the work goes into the data. Chapter 5 spends real time on this because it's where many production models live or die.

Pretraining text is collected from the web, books, code repositories, and other sources, totaling hundreds of billions to a few trillion tokens for a modern model. The raw collection is then aggressively cleaned: duplicates are removed, low-quality material is filtered out, harmful or copyrighted material is screened, and the result is rebalanced so that no single source dominates. Each of these steps requires its own engineering and policy work.

The mix and quality of the data shape the resulting model far more than people realize. A model trained on a curated, well-balanced corpus can outperform a model with twice the parameters trained on raw scraped data. This is one reason why open-weights models from well-resourced labs continue to improve even as parameter counts plateau — the data work is improving.

Loss functions, in plain language

The loss function is the mathematical scorecard that tells the training process how well the model is doing. For language models, the standard choice is cross-entropy loss — a measure that punishes confident-wrong predictions much more than uncertain-wrong predictions.

You don't need to follow the math to use the intuition. A model that's mostly right with low confidence has a moderate loss. A model that's mostly right with high confidence has a low loss. A model that's confidently wrong has a very high loss. The training process is designed to drive the loss down, which in effect teaches the model to be confident only when it should be.

Chapter 5 explains why cross-entropy is the right choice, what alternatives exist, and what the loss curve actually looks like during a training run (spoiler: it goes down sharply at first, then slowly for a long time, with periodic bumps as the learning rate changes).

Why training takes months and costs millions

The numerical operations that make up a training step — matrix multiplications, additions, normalizations — are individually fast on a single GPU. The catch is that one GPU isn't enough to hold a frontier model in memory, let alone train it in a reasonable time. So training is spread across thousands of accelerators wired together with high-bandwidth interconnects.

Three flavors of parallelism are typically combined. Data parallelism puts a full copy of the model on every device and feeds different batches of data to each one, averaging the gradients across the devices. Model parallelism splits the model itself across devices, so each one holds only some of the layers. Pipeline parallelism staggers the work across devices so they don't sit idle waiting for one another.

Each of these is its own engineering discipline, with its own failure modes. Devices fail in the middle of training and have to be hot-swapped. Network congestion shows up as training stalls. Numerical instabilities cause runs to diverge. Frontier-scale training is more about industrial reliability than algorithmic cleverness.

Important: The cost of a frontier training run today is dominated by electricity, hardware depreciation, and people, in roughly that order. The actual mathematical work is the cheap part of the bill.

Overfitting and the balance you have to strike

The chapter closes by discussing two failure modes that every training run navigates. Overfitting means the model memorizes its training data instead of learning the patterns underneath; it produces a model that performs well on the training data but poorly on anything new. Underfitting means the model hasn't been trained enough to capture the structure in the data; it produces a model that's bad at everything.

The space between them is narrow, and several standard tools — collectively called regularization — are used to keep training inside it. Dropout, weight decay, careful learning rate schedules, early stopping. None of these are exotic. All of them are essential.

What Chapter 5 sets up

By the end of Chapter 5, you have a clear picture of what a frontier model is, materially. You can read a press release about a new training run and place its claims accurately. You understand why the engineering of these systems is now a national-security-scale concern in some countries, and why the public conversation about AI is increasingly a conversation about data, power, and infrastructure.

Next up — Chapter 6: Fine-Tuning & Adaptation. Tomorrow we look at how a pretrained model becomes useful. Fine-tuning, instruction tuning, parameter-efficient methods like LoRA, and the alignment techniques (RLHF and its descendants) that turn raw next-token predictors into helpful assistants.

Want the full picture? The book breaks down the full training pipeline, including the data curation steps that most introductions skip, with diagrams of the parallelism strategies used in real frontier runs. Grab LLM Primer I on Amazon →

Chapter 5 — Training Large Models: What Actually Goes Into a Frontier Model