Chapter 11 — Evaluation, Calibration, and Inference
Eleventh post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. In which we ask how anyone can possibly measure a machine that can say anything — and discover that a confident model is often a poorly-calibrated one.
The question that turns out to be mathematical
We have built a model in Part II, trained it in Part III, and aligned it in Chapter 10. How do we know if any of it actually worked? This sounds like a soft question. It is one of the hardest and most mathematical in the field, because a language model can produce essentially any text, and "good" resists definition.
Chapter 11 is the chapter that equips you to measure rigorously — to separate genuine progress from noise, hype, and self-deception. It is also the chapter where the book is most direct about how much of the public conversation around AI rests on numbers that should never have been reported without error bars.
11.1 Perplexity, likelihood, and the intrinsic yardstick
The most fundamental measure of a language model needs no human at all, because it falls straight out of the training objective. A good model assigns high probability to real text, so we can measure the probability it assigns to a held-out test set it never trained on. Expressed per token and exponentiated, this is perplexity — which Chapter 1 introduced and which we can now place at the center of evaluation.
Perplexity is the exponentiated cross-entropy: the model's average surprise, expressed as an effective branching factor. A perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 words. Lower is better, and because it is just the test-set likelihood, it is objective, automatic, and cheap.
It has two great virtues and one great limitation. Its virtues: it requires no labels and no humans, and it is mathematically clean. Its limitation: it is blind to almost everything we care about in practice — usefulness, truth, safety. It also is not comparable across tokenizers, because the unit ("per token") is not the same in every model.
The chapter also surveys the small zoo of output-quality metrics that grew up to fill that gap — BLEU and ROUGE for translation and summarization, code-execution rates for code, judge-model scores for open-ended generation — each a different answer to "how do we score text against a reference?" Each of them has known failure modes the chapter is explicit about.
11.2 Calibration: does confidence match reality?
Here is a property that matters enormously and is widely overlooked. A model is well-calibrated if its confidence matches its accuracy. When a calibrated model says it is 80% sure, it should be right about 80% of the time.
This is not the same as being accurate. A model can be accurate yet overconfident, or even inaccurate yet honest about it. For any high-stakes use of an LLM, calibration is as important as accuracy — perhaps more so, because a well-calibrated model that occasionally says "I don't know" is far more useful than an accurate model that never does.
Calibration is measured by binning predictions by their stated confidence and checking, within each bin, whether the actual accuracy matches. Plotting accuracy against confidence gives a reliability diagram: a perfectly calibrated model traces the diagonal; a model that bulges below it is overconfident (the common case); a model that bulges above is underconfident. The average gap is summarized by Expected Calibration Error (ECE) — a single number that tells you how much your model's confidence can be trusted.
The chapter also shows what to do about it. Temperature scaling — dividing the logits by a learned scalar — is a simple, effective post-hoc calibration that often works. RLHF, interestingly, often worsens calibration: a model trained to sound confident becomes overconfident in ways that ECE catches and perplexity does not.
11.3 Benchmark uncertainty: why a score needs an error bar
When a new model "scores 87% on a benchmark," the number invites a question almost no one asks: 87% plus or minus what? A benchmark score is an estimate, computed on a finite sample of test questions, and like any estimate from a sample it has uncertainty.
The chapter walks through the math. For a benchmark of n questions, the standard error of an accuracy near 50% is roughly 1/(2√n). For n = 1000, that is about 1.6 percentage points — meaning a model scoring 87% and a model scoring 85% are not necessarily different at all. Two hazards compound this: multiple comparisons (evaluate enough models on enough benchmarks and some will look better by chance) and contamination (if a benchmark's questions leaked into the model's training data, the score measures memorization rather than capability).
This section is, in my honest opinion, the section the AI press most needs to read.
11.4 Hallucination and the mathematics of retrieval
The failure mode that most defines an LLM's limits — hallucination, the confident assertion of falsehood — is also one of the hardest to measure, precisely because it requires judging truth. Perplexity cannot see it.
Measuring hallucination means checking generated claims against trusted sources, either by human annotation or by automated faithfulness metrics that test whether each statement in an answer is entailed by the supplied context. The chapter walks through what these metrics actually compute.
And the leading tool for reducing hallucination returns us, fittingly, to the geometry of Part I. Retrieval-augmented generation grounds the model in real documents, and its core operation is mathematical: find the passages most relevant to a query. This is maximum inner-product search — embed the query and every candidate passage as vectors (Chapter 3), and find the passages whose embeddings have the highest dot product with the query embedding. The geometry of Chapter 3 is suddenly load-bearing for production.
What Chapter 11 sets up
You leave this chapter with the toolkit of honest measurement: perplexity as the intrinsic yardstick, calibration as the question that often matters more than raw accuracy, error bars as the antidote to benchmark theater, and the retrieval geometry that grounds hallucination control. That closes Part III. The book turns from how models are built and measured to what we actually do with them.
Next — Chapter 12: Real-World Applications of LLMs. The first chapter of Part IV. Text generation and summarization, question answering, translation, reasoning — what each one looks like through the math we now have.