Chapter 1 — Mathematical Intuition for Language Models
First post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. We begin where the book begins — by removing the wall that math usually puts between you and the idea.
The problem this chapter solves
Most books that promise to explain LLMs "with the math" do one of two things. They either skip the math entirely and call the result intuition, which is fine until you hit a real claim and have nothing to check it against. Or they open with twenty equations and let you sort yourself out, which is fine if you already know.
Chapter 1 refuses both. It says: before any equation, we agree on what symbols are for, what probability really is in the context of language, and what entropy is measuring. After that, every later chapter is allowed to write things down.
Getting comfortable with mathematical notation
Section 1.1 takes the symbols you will meet again and again — summations, expectations, conditional probabilities, vectors, the occasional log — and treats each one as a piece of compressed English. A summation is "add these up." An expectation is "the average you'd see if you kept drawing samples." A conditional probability is "given that this is true, how likely is that."
The chapter is careful about one thing in particular: it never asks you to be impressed by a symbol. Every piece of notation is introduced after the idea it stands for is already in your head.
Probability for language generation
Section 1.2 brings probability into the room and shows what it has to do with language. The setup is the one Book I introduced and the rest of Book II will assume: a language model assigns a probability to every possible next token, given the tokens that came before. Generation is sampling from that distribution.
So everything else in the book — attention, transformers, training, scaling — exists in service of one task: estimating that distribution well. Chapter 1 makes that explicit, and starts you with the basic moves of probability you need to talk about it: joint and conditional probability, independence, the chain rule, why log-probabilities show up everywhere.
You will also meet a worked example. A toy "language" with a handful of tokens, a tiny training corpus, and a probability table you can compute by hand. It is small enough to fit in your head — and large enough that the same shape will be recognizable inside a transformer later.
Entropy and information: measuring uncertainty
Section 1.3 is the quiet hinge of the chapter. Entropy, in the Shannon sense, is the number this book will reach for whenever it needs to say "how uncertain is this distribution?" — and the book reaches for it constantly. Cross-entropy will become the loss function. KL divergence will become the distance between distributions. Perplexity will become the headline evaluation metric. All three are entropy in slightly different clothes.
The section derives entropy gently. You meet Shannon's question — how many yes/no questions, on average, would you need to identify a sample? — and watch the formula arrive as the answer. By the end, log-probabilities are not a mystery. They are bits of information.
The story underneath
The chapter also does something the rest of the book repeats: it introduces ideas through the people and moments that produced them. Claude Shannon shows up here — a young engineer at Bell Labs in the 1940s asking how much information a telegraph wire could carry, and inventing, almost as a side effect, the mathematics that now measures whether a language model is doing well.
This is not history for its own sake. The story makes the formula remember-able.
What Chapter 1 sets up
By the end of Chapter 1, the toolkit is in place: a vocabulary of symbols you can read without flinching, a clear picture of probability as "what comes next," and entropy as the way we measure how peaked or how flat the model's belief is. That toolkit will carry you through everything that follows.
Next — Chapter 2: LLMs in Context. A compact tour of what an LLM is, how pretraining and parameter scale come together, what makes language unusual as data, and why the transformer architecture rewrote the field in 2017. The bridge between Book I's plain-language story and Book II's math.