Chapter 4 — Attention: The Core Mechanism

Fourth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. We arrive at the chapter the rest of the book leans on.

The chapter the book is built on

Almost every later chapter in Book II — position encodings, transformer blocks, efficiency, training, scaling — depends on the attention derivation in Chapter 4. If you only had time to read one chapter, this is the one I'd hand you.

The chapter has a goal more specific than "explain attention." Plenty of articles already do that. The goal is to derive attention from intuition, in the order someone might actually have invented it, so that by the end you don't just know the formula — you know why the formula has the shape it does.

4.1 Self-attention: equations, geometry, and intuition

Section 4.1 starts with a need, not a formula. Suppose you want each token in a sequence to be able to look at every other token and decide how much each one matters for understanding it. What's the simplest possible mathematical machine that does that?

The chapter builds it step by step. You start with a single vector per token. You let each token produce a "query" — a question it's asking. Every token also produces a "key" — a label advertising what it has to offer. You compute the similarity of each query against each key with the dot product you saw in Chapter 3. Higher similarity, more relevant. You normalize those similarities into a probability distribution with softmax. And you use that distribution to take a weighted average of "value" vectors — the actual content each token contributes.

That's the entire mechanism. Queries, keys, values, dot product, softmax, weighted sum. The famous formula — softmax(QK^T/√d) V — is just the matrix-form bookkeeping for doing it all in parallel.

The chapter is careful about geometry. Queries and keys live in the same space because we're comparing them. Values live in a separate space because they're answering, not asking. The √d in the denominator is the scale correction that keeps the dot products from blowing up the softmax when d gets large — and it's a fix you can see once you've watched what happens without it.

One line: attention is "for each token, ask every other token a question, score the answers, and take a weighted average of what the answers say." The formula is the bookkeeping.

4.2 Multi-head attention, normalization, and residual paths

One attention head is one way of asking questions. Section 4.2 explains why real transformers use many heads in parallel, and why the chapter calls this a structural choice rather than a trick.

Multi-head attention runs several attention computations side by side, each in a smaller subspace, and concatenates their outputs. The intuition: different heads can learn to look for different kinds of relationships — syntactic agreement here, coreference there, topic shift in a third. The math: a clean factorization that gets you more representational power without a proportional explosion in compute.

The section then introduces the residual stream — the idea that each layer adds its contribution to a running representation rather than replacing it — and layer normalization, the small stabilizing operation that makes training depths of 64 or 128 layers possible at all.

4.3 Softmax: stability, temperature, interpretation

Softmax is shorter than it looks. Section 4.3 spends real time on it anyway, because it shows up in three different places (inside attention, at the output for next-token prediction, and behind the temperature knob you set when sampling), and getting it wrong in implementation is a famous source of training instability.

The section walks through the numerical stability trick — subtracting the maximum before exponentiating — and shows why temperature, the same knob you met in Book I, is just dividing the logits by a constant before softmax. Low temperature: peaked distribution, decisive sampling. High temperature: flatter distribution, more variety. One operation; many uses.

4.4 Attention as a kernel method

The chapter ends with a perspective that ties attention to a much older line of mathematics. Section 4.4 shows that attention is a particular instance of a kernel smoother — a way of estimating a function by averaging the values of nearby points, weighted by similarity.

This isn't a side dish. It explains why attention generalizes so well to new sequences (the kernel structure is doing real work), and it opens the door to the "linear attention" and "kernelized attention" variants that show up in Chapter 7 when efficiency matters more than the original formulation.

Worth holding onto: attention is not unique to transformers as an idea — it's a fresh application of a much older one. The transformer's contribution was to use this idea everywhere, in parallel, instead of as a small subroutine attached to a sequential model.

What Chapter 4 sets up

You finish Chapter 4 with a mental picture of attention that is precise enough to compute, but vivid enough to remember. From here, every later chapter assumes you can read the formula. Position encodings will fit alongside it. The feed-forward layer will sit on top of it. The efficiency chapter will rewrite it. The training chapter will say "this is what we're optimizing."

Next — Chapter 5: Position, Order, and Sequence Structure. Attention by itself doesn't know that "the cat sat" is different from "sat the cat." This chapter installs the missing sense of order — through sinusoidal encodings, relative position, RoPE, and a Fourier-shaped view of the whole apparatus.

Want the full picture? The book runs a complete worked example of one attention layer end-to-end, with the matrices small enough to follow by hand, and connects each step to a diagram of the geometry. View LLM Primer II on Amazon →