Chapter 6 — Transformer Blocks and Representation Power
Sixth post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. We zoom out from a single attention layer to the full transformer block — and ask, mathematically, what kinds of functions it can compute.
Why this chapter exists
Chapters 4 and 5 built attention plus position. That is half a transformer block. The other half — the feed-forward network — gets less attention in popular writing, and that's a mistake. A transformer that was only attention would be a remarkably weak computer. The feed-forward layer is what gives each block its punch.
Chapter 6 derives that pair from first principles, and then asks the deeper question: what kinds of functions can a stack of these blocks represent? The answer turns out to be both reassuring and surprising.
6.1 Feed-forward networks in transformers
The feed-forward sublayer inside a transformer block is a tiny two-layer neural network — applied to each token independently. One linear transformation up to a wider dimension, a nonlinearity, and one linear transformation back down. That's it.
Section 6.1 walks through the dimensions carefully. The inner dimension is typically 4× the model dimension. That ratio is not magic — it is a tradeoff between expressivity and compute that has been empirically validated over and over. The chapter shows what each matrix is doing, why per-token application (rather than per-sequence) is the right design, and how the feed-forward layer interacts with the residual stream we met in Chapter 4.
6.2 Activation functions and nonlinearity
Section 6.2 is short but important. The nonlinearity inside the feed-forward layer — historically ReLU, more recently GELU and SwiGLU — is what allows a stack of linear operations to compute something other than another linear operation.
The section derives why: without a nonlinearity, the composition of two linear layers is a single linear layer, and the network is mathematically no more powerful than logistic regression. With a nonlinearity, the entire universe of differentiable functions becomes reachable in principle. The choice of which nonlinearity affects training dynamics, gradient flow, and final quality — and the chapter walks through why each successive popular choice replaced the last.
6.3 Why "attention + FFN" works
Section 6.3 is where the chapter gets ambitious. It argues that attention and feed-forward layers are not just two useful modules placed next to each other — they are complementary in a deep mathematical sense.
Attention computes a structured weighted average of token representations. By itself, it cannot perform arbitrary nonlinear transformations. The feed-forward layer can perform arbitrary nonlinear transformations on a single token, but it cannot route information between tokens. Together — alternating attention with feed-forward, layer after layer — they trade roles, and the resulting block is dramatically more expressive than either piece alone.
The chapter sketches this with a clean small example: a task that pure attention provably cannot solve, a task that pure feed-forward provably cannot solve, and the combined block that handles both.
6.4 Expressivity of transformer architectures
Section 6.4 takes the question to its mathematical conclusion. What is the class of functions a stack of transformer blocks can approximate?
The section walks through the main results from the literature. Under reasonable conditions, a sufficiently deep transformer is a universal sequence-to-sequence function approximator — meaning, informally, it can compute anything you could compute with enough patience. The proofs are technical, but the chapter unpacks the intuition: attention provides the routing, feed-forward provides the computation, and a deep stack of both can implement essentially arbitrary algorithms.
This is a result that should be held lightly, by the way. Universal approximation tells you what is possible in principle. It says nothing about what is learnable in practice, what is efficient, or what generalizes. The book is careful about that distinction.
6.5 Depth, width, and universal approximation
The chapter ends with the practical question every engineer asks. Given a compute budget, should I build a deep narrow model or a shallow wide one?
Section 6.5 lays out the math. Width buys expressivity within a layer — more dimensions, more directions in which tokens can store information. Depth buys composition — each layer can refine what previous layers produced. There is a tradeoff, and the literature has converged on shapes that balance them. The chapter shows what the scaling laws (which we'll meet in Chapter 8) imply about the optimal shape, and why the answer has changed as compute has gotten cheaper.
What Chapter 6 sets up
You finish Chapter 6 with a complete picture of one transformer block — attention plus feed-forward plus residuals plus normalization — and a sense of what stacking many of them buys you mathematically. From here, Part II ends with one chapter on what happens when these blocks meet real hardware.
Next — Chapter 7: Efficiency and Transformer Variants. The closing chapter of Part II. Attention is O(n²) in sequence length, and that's a problem at modern context lengths. We'll work through the math of GPU memory and throughput, derive FlashAttention from first principles, and survey the family of clever variants — multi-query, low-rank, gated — that keep big models running.