Chapter 5 — Position, Order, and Sequence Structure

Fifth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The chapter that fixes a problem you didn't notice attention had.

The problem attention forgot

Chapter 4 gave us attention. Chapter 5 catches the bug. If you read the attention formula carefully, you'll notice something strange: it treats the input as a set of tokens, not a sequence. Reorder the tokens, and the output reorders along with them — but no token "knows" which position it's in.

For language, that's a disaster. "The cat sat on the mat" and "the mat sat on the cat" share every token. Attention as written can't tell them apart. Chapter 5 is about how transformers fix this — and about why the fix turned out to be more interesting than anyone expected.

5.1 Why position matters in language

The opening section makes the bug concrete. It runs the same attention computation on two orderings of the same set of tokens, shows the outputs are equivariant under permutation, and demonstrates by example that you cannot solve this problem from inside the attention formula. You have to add information from outside.

That information is positional encoding — a vector you add (or otherwise inject) into each token's embedding before attention sees it, carrying the message "I am the third token."

One line: attention by itself sees a bag of tokens, not a sentence. Positional encoding is how it learns to read in order.

5.2 Sinusoidal encodings and periodic structure

The original Transformer paper used sinusoidal positional encodings. Section 5.2 derives them, and shows why a particular family of sines and cosines, with frequencies spaced geometrically across the embedding dimensions, has the property the architecture needs: the model can recover relative position from differences of encodings, without ever being told the formula.

The chapter unpacks the math behind that property — and why "use sines and cosines" is not as arbitrary as it looks. The frequencies cover scales from one-token-apart to thousands-apart in a single embedding vector. The same construction lets a model trained on short sequences extrapolate, partially, to longer ones.

5.3 Relative positional encodings

Section 5.3 makes a small but important step. What attention really needs, often, is not the absolute position of each token but the relative distance between pairs. "The verb is three tokens after the subject" is more useful than "the subject is at position 7."

The section introduces relative position encodings — variants where the position information is folded directly into the attention scores rather than added to the token embeddings. The math is small. The effect is significant: better generalization to sequence lengths the model didn't see in training.

5.4 Rotary positional embeddings (RoPE) and geometric rotation

Section 5.4 is, for many readers, the highlight of the chapter. Rotary positional embeddings — RoPE — are the position encoding that has quietly become standard in modern open-source models.

The chapter derives RoPE from scratch. The intuition is geometric: instead of adding a position vector to a token embedding, RoPE rotates the query and key vectors by an angle proportional to their position. When two tokens are at the same distance apart anywhere in the sequence, their rotated dot product is the same — which means the model sees relative distance for free, baked directly into the attention computation.

The math is striking once you see it: a few complex-number identities and the whole construction falls out. The chapter walks through the derivation gently, with a small numerical example, and shows why RoPE has come to dominate the field.

5.5 Positional encoding through the lens of Fourier analysis

The final section ties the whole chapter together with a step back. Sinusoidal encodings, RoPE, even relative position — they are all, in different language, doing Fourier analysis. They are decomposing position into a sum of oscillating components at different frequencies, and letting the model learn how to weight them.

This is not a coincidence. Position is, in a real mathematical sense, the dual of frequency. Once you see that, a lot of the literature snaps into place — why some position schemes generalize to longer sequences, why others don't, why the frequency spacing matters, and why future innovations in this area will likely look familiar to anyone who has worked with signal processing.

Worth holding onto: positional encoding is the place where the geometry of attention meets the geometry of waves. The choice of position scheme is a choice about which frequencies the model can see.

What Chapter 5 sets up

You finish Chapter 5 with attention now able to read sequences, not just sets — and with enough Fourier intuition to evaluate the next position-encoding paper that crosses your feed. From here, we put the attention layer and the position machinery together with the other half of a transformer block: the feed-forward network.

Next — Chapter 6: Transformer Blocks and Representation Power. Attention plus feed-forward plus residuals — the recipe of the modern transformer block. Why this particular combination is more expressive than either piece alone, and what depth and width actually buy you mathematically.

Want the full picture? The book derives RoPE in full, with the complex-number identities laid out, and includes diagrams of the position-frequency decomposition that make the Fourier connection concrete. View LLM Primer II on Amazon →