Chapter 7 — Efficiency and Transformer Variants
Seventh post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The closing chapter of Part II — the one where the math meets the hardware.
The wall attention runs into
The attention formula we derived in Chapter 4 has a property that didn't matter on small sequences and matters enormously on large ones: its compute and memory grow with the square of the sequence length. Double the context, and the attention layer needs four times as much of everything.
For a chat with 200 tokens, you don't notice. For a context window of 200,000 tokens, you notice immediately — and quite possibly run out of GPU memory before you finish a single forward pass.
Chapter 7 is about that wall. What the math actually says about the cost. What modern systems do to push the wall further out. And the family of "transformer variants" that change the attention formula itself in clever ways.
7.1 Computational complexity of attention
Section 7.1 derives the cost in detail. For a sequence of length n with embedding dimension d, computing the attention matrix QKT takes O(n²·d) operations. The softmax adds another O(n²) on top. The weighted sum with V costs another O(n²·d). The total — O(n²·d) — is fine when n is in the hundreds, painful in the thousands, and prohibitive past that.
The memory cost is even worse in practice. The intermediate attention matrix itself is n × n, and on modern GPUs that matrix has to live in memory until softmax completes. At long context lengths, this matrix becomes the bottleneck before the compute does.
7.2 GPU memory and throughput mathematics
Section 7.2 is where the book sets aside attention for a moment and unpacks the hardware. GPUs have a fast, small on-chip memory (SRAM), a slower but much larger high-bandwidth memory (HBM), and a third tier of even slower CPU/disk memory. Performance is dominated by how often you have to move data between these tiers.
The section introduces the concept of arithmetic intensity — the ratio of compute to memory access — and the roofline model that predicts, for a given algorithm and a given GPU, whether you'll be limited by compute or by memory bandwidth. The conclusion that matters here: for naive attention at modern context lengths, you are memory-bound long before you are compute-bound.
This is the mathematical setup that makes the next section possible.
7.3 FlashAttention and memory-aware computation
FlashAttention is one of the most important practical innovations in recent transformer history. Section 7.3 derives it carefully.
The insight is small and beautiful. Standard attention materializes the full n × n attention matrix in HBM before doing the softmax. FlashAttention reorganizes the computation so that softmax is done block by block, in small tiles that fit entirely in SRAM, with a clever running-statistics trick that lets you keep the result mathematically equivalent to standard softmax without ever holding the whole matrix at once.
The result is the same numbers — bit-for-bit, modulo floating point — but with dramatically less HBM access. On real hardware, that often means 2–4× speedups for long-context attention with no model changes. The chapter walks through the recursion that makes this possible, and shows why it works.
7.4 Multi-query attention and gated architectures
Section 7.4 moves from changing how attention is computed to changing what attention is.
Multi-query attention (MQA) keeps the multi-head structure for queries but shares a single key and value across all heads. This reduces the size of the KV cache — the working memory of inference — by a factor equal to the number of heads, which translates directly to lower latency and lower memory cost during generation. Group-query attention (GQA) sits between standard multi-head and MQA, sharing keys and values among groups of heads.
The section also covers gated architectures, where the residual contributions of attention and feed-forward layers are modulated by learned gates — letting the model decide, dynamically, how much to use each layer.
7.5 Low-rank and approximate attention methods
Section 7.5 closes the chapter with the wider universe of approximation. If attention is O(n²) and that's the problem, then any factorization that approximates it well in O(n·log n) or O(n) is, in principle, useful.
The chapter surveys the main families. Linear attention rewrites softmax attention as a kernel computation that can be reorganized to be linear in n. Performer uses random feature maps to approximate the softmax kernel. Sparse attention restricts each token to attending only to a small fixed pattern of others. Sliding-window attention restricts to local context. Each one trades exactness for cost. The chapter shows what the math costs you, and where each approximation is — and isn't — a good idea.
What Chapter 7 sets up
You finish Chapter 7 with a clear sense of the costs the rest of the field has been fighting against, and the toolkit of standard answers. That closes Part II. From here, the book turns to a different kind of mathematics — the optimization theory that explains how models learn, and the scaling laws that predict what they will be capable of.
Next — Chapter 8: How Models Learn. The first chapter of Part III. Why overparameterized models generalize at all (which they were not "supposed" to), the implicit bias of gradient-based optimization, the empirical scaling laws that predict capability before training even starts, and the open mathematical questions that remain.