Chapter 10 — Post-Training and Alignment Mathematics

Tenth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. In which a brilliant but feral next-word predictor is civilized into a helpful assistant — and an entire reinforcement-learning pipeline collapses, through one elegant derivation, into something you can train like an ordinary classifier.

Why this chapter exists

Chapters 8 and 9 produced a pretrained model. It has read much of the internet, and it can continue any text with uncanny fluency. It also has no particular inclination to be helpful. Ask it a question and it might generate more questions, because that is a pattern it saw. It is brilliant and feral at once.

Chapter 10 is the bridge between that creature and the assistant you actually interact with. It is also one of the more mathematically beautiful chapters in the book, because the engineering of alignment turns out to rest on three clean ideas in a row — and the third of them is unreasonably elegant.

One line: post-training in three movements — supervised fine-tuning teaches the model to imitate good answers, a reward model learns human preferences, and preference optimization tunes the model to satisfy them, with a KL leash that keeps it close to the original.

10.1 Supervised fine-tuning: learning to imitate

The first and gentlest step is supervised fine-tuning (SFT), and mathematically there is nothing new in it at all — which is itself illuminating. You gather (prompt, ideal-response) pairs, written or curated by humans, and you train the model on them with exactly the cross-entropy loss of Chapter 1: maximize the probability the model assigns to the ideal response, given the prompt.

This single change — curating the data — does a remarkable amount. By imitating thousands of examples of a helpful assistant answering questions, refusing harmful requests, and formatting responses well, the model learns to behave like that assistant rather than like the average internet page.

But SFT has a ceiling, and the ceiling is the reason the next two movements exist. Imitation can only ever reproduce its demonstrators — it cannot exceed them, and it gives you no way to express that one response is a little better than another. Worse, high-quality demonstrations are expensive: writing the ideal answer to every kind of question is far harder than recognizing a good one when you see it.

10.2 Reward models and the mathematics of preference

If writing the perfect answer is hard but comparing two answers is easy, then let us collect comparisons. Show a human two responses to the same prompt and ask only: which is better? This produces preference data — pairs labeled "winner over loser" — far more cheaply and reliably than demonstrations.

The bridge from a pile of noisy human comparisons to a smooth scoring function is a classical piece of statistics from the 1950s: the Bradley–Terry model. It assigns each item a hidden strength and says the probability one beats another is governed by the difference in their strengths through a logistic function. The reward model is trained to make this probability match the human labels, by minimizing the negative log-likelihood of the observed preferences.

Read this and recognize it: it is just logistic regression on differences of rewards. The output is a learned function that distills a crowd of fuzzy, inconsistent human judgments into a single smooth number — a proxy for "how much would a human like this answer?" — usable wherever raw human labels are not.

10.3 RLHF: optimizing against the reward, on a leash

With a reward model in hand, the goal seems obvious: adjust the language model (now called the policy) to produce responses the reward model scores highly. This is reinforcement learning from human feedback (RLHF), and the naive objective is just to maximize expected reward. But the naive objective is a trap, and understanding the trap is the key to the whole enterprise.

The reward model is only a proxy, trained on limited data, with blind spots. A policy optimized hard enough against it will find and exploit those blind spots — producing degenerate text that scores absurdly high on the reward model while being gibberish to a human. The field calls this reward hacking, and it is exactly the specification gaming the book warned about in earlier chapters, in its most concrete form.

The fix is to add a penalty: a KL-divergence term that pulls the policy back toward the original pretrained model. The first term pulls the policy toward high reward; the second, scaled by a coefficient β, forbids it from wandering into the strange high-reward regions where the proxy is wrong. The whole art is in the balance: too little leash and the model hacks the reward; too much and it never improves. The chapter walks through what optimization on this constrained objective looks like in practice — PPO and its cousins.

10.4 DPO: when the reinforcement learning melts away

Here is one of the prettiest results in recent machine learning, and a genuine "aha". The RLHF objective above looks like it requires the whole apparatus — a separate reward model, a reinforcement-learning loop, sampling from the policy. Direct Preference Optimization (DPO) showed that it does not. Through a clean derivation, the entire pipeline collapses into a single supervised loss you can train with the same tools as any other classifier.

The derivation hinges on a known fact: the KL-constrained reward-maximization objective has a closed-form optimal solution — the reference policy reweighted by the exponentiated reward. The DPO insight is to run this backwards: solve that relationship for the reward in terms of the optimal policy, substitute back into the Bradley–Terry preference loss, and watch the reward model disappear. What remains is a loss expressed entirely in terms of the policy's own log-probabilities, on the chosen and rejected responses, against the reference model.

The form looks like the reward-model loss, but now the "reward" is implicit — it is the policy's own log-probability ratio against the reference. Training simply increases the model's relative preference for the chosen response over the rejected one, with the reference ratio and the coefficient β playing the role of the leash automatically. No separate reward model. No RL loop. Just supervised learning on preference pairs.

Worth holding onto: DPO is the kind of result that makes the field briefly feel small and elegant. A whole moving zoo of components — reward model, policy, PPO, KL penalty, sampling — folds neatly into one supervised loss. The same mathematics, less machinery.

10.5 Best-of-n, the alignment tax, and honest cautions

A simpler alternative deserves mention because it is widely used: rejection sampling, or best-of-n. Generate n candidate responses, score them all with the reward model, keep the best. It requires no policy training at all — just extra inference. It is a strong, dead-simple baseline, at the cost of n× generation compute.

The chapter closes with two honest cautions. First, alignment can extract an alignment tax: a model tuned hard for helpfulness and safety sometimes loses a little raw capability, becoming more cautious or more verbose. Second, and more fundamentally, every method here optimizes for human approval, which is not the same as truth or goodness. A model can learn to be liked without learning to be right.

The chapter also covers a topic that has become important fast: RLAIF — using one model's judgment to align another — and Constitutional AI, where the values are written down in plain language and the model is trained to follow them. Both gesture at the deeper problem of scalable oversight: how humans supervise systems that may eventually exceed humans at the tasks being judged.

What Chapter 10 sets up

You leave this chapter with a clear picture of how a pretrained model becomes the assistant on your screen — three movements, two beautiful pieces of statistics (Bradley–Terry, DPO), and a set of honest limitations the field is still working on. From here, the book turns to a related and equally mathematical question: now that we have built and aligned a model, how do we know if it's any good?

Next — Chapter 11: Evaluation, Calibration, and Inference. Perplexity, calibration, the error bars that every benchmark score should carry, and the mathematics of measuring hallucination. The chapter where we ask how anyone can measure a machine that can say anything.

Want the full picture? The book includes the full Bradley–Terry derivation, the DPO closed-form solution and its substitution proof, and the three-model RLHF choreography drawn out diagrammatically — plus the appendix's worked derivation of DPO from scratch. View LLM Primer II on Amazon →