Chapter 4 — The Transformer Architecture

This is Part 4 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we saw why self-attention replaced recurrence as the dominant neural architecture for language. Today we open up the Transformer itself — the specific design that took attention from a clever idea to the foundation of every modern LLM.

A Transformer is a stack

The first thing to know about the Transformer is that it's modular. The actual architecture consists of a single building block — called a Transformer layer or Transformer block — repeated many times in a stack. Modern LLMs have anywhere from 32 to over 100 of these layers stacked on top of one another. Every layer has exactly the same internal structure; what changes is what each one has learned to do as the input passes through.

You can think of the stack as a refinement pipeline. The first few layers tend to handle low-level patterns — token identity, basic syntactic relationships. Middle layers handle more abstract structures — phrase-level meaning, references, basic inference. Higher layers handle very abstract relationships — overall topic, tone, task framing. By the time the text has passed through the entire stack, each token has been enriched with context drawn from across the whole input.

Key idea: A Transformer is one block repeated dozens of times. The architecture is much simpler than the model's outputs would suggest. The depth, and what each layer learns through training, is what produces the capability.

Inside the block: attention and a feedforward network

Each Transformer block has two main pieces. The first is multi-head self-attention — multiple attention computations running in parallel, each one learning to attend to a different kind of relationship. One head might learn to track subject-verb agreement; another might track which pronoun refers to which noun; a third might track topical coherence. None of these are programmed; they emerge as side effects of training.

The second piece is a feedforward network — a small standard neural network that operates on each token independently. After attention has mixed information across tokens, the feedforward step lets the model do per-token processing, applying whatever transformation it has learned to each enriched token representation.

Both pieces are wrapped with two technical details that matter for stability: residual connections (which let information bypass each piece and skip directly forward) and layer normalization (which keeps the numbers in a stable range across the depth of the stack). Without these tricks, training a stack as deep as a modern LLM doesn't work.

Self-attention, with a little more precision

Chapter 4 gives self-attention the careful treatment it deserves, including the math, but the mechanism can be described intuitively. Every token produces three vectors — called the query, the key, and the value. The query says "this is what I'm looking for." The key says "this is what I represent." The value says "this is what I'll contribute if you find me useful."

Attention works by comparing every token's query against every other token's key, producing a matrix of similarity scores. Those scores are normalized into weights that sum to one (using softmax), and then each token's new representation becomes a weighted sum of all the other tokens' values. The whole operation is a few lines of matrix algebra.

The book includes a six-line code sketch of this computation, because seeing it compactly in code makes it click for many readers in a way that the equations alone don't. The book also explains why each piece is there — why scaling by the square root of the dimension matters, why softmax, why three separate vectors instead of one.

How the model knows word order

Self-attention has a property that sounds harmless but isn't: it doesn't naturally encode order. To the math, a sentence is an unordered set of tokens. Without intervention, "dog bites man" and "man bites dog" would look identical.

Positional encoding fixes this by tagging each token with information about where it sits in the sequence. The original Transformer used a clever trick with sine and cosine waves at different frequencies. Modern variants use learned positional embeddings or rotary position encodings (RoPE) that handle long context lengths more gracefully. The details vary; the principle doesn't.

Important: The choice of positional encoding directly limits how far a model can attend reliably. Stretching a model to handle longer contexts than it was trained on is non-trivial, which is why every model has a stated context window — and why some are 4,000 tokens while others are over a million.

Encoder, decoder, or just decoder?

Early Transformer research produced three flavors. Encoder-only models like BERT are designed to read text and produce a deep representation; they're great for classification, embedding generation, and search. Decoder-only models like GPT are designed to generate text one token at a time; they're what powers most chat-style LLMs. Encoder-decoder models combine the two, with the encoder digesting input and the decoder generating output; they're useful for translation and structured tasks.

Today, decoder-only models dominate the consumer-facing AI market because the same machinery handles reading the prompt and writing the response. The distinction still matters when you're choosing a model for a specific job, and the book walks through when each type is the right tool.

The scaling story, and why it works

Chapter 4 closes by explaining how Transformers scale. As you increase parameters, training data, and compute — together, in coordinated ratios — model performance improves in a remarkably predictable way. This empirical finding, known as scaling laws, is what justified the massive investments of the last several years. Doubling the parameters of a Transformer roughly halves the loss, within certain ranges. The relationship is so consistent that researchers can predict the performance of a model before it's trained.

The book is careful to explain what scaling laws don't tell you — about emergent capabilities, about the marginal value of additional scale, and about the ways in which the simple "bigger is better" narrative breaks down. Modern frontier development is much less about brute scale and much more about quality of data, architectural tricks like mixture-of-experts, and clever training methods. That story continues in later chapters.

What Chapter 4 sets up

By the end of Chapter 4, you can read any modern LLM paper or technical announcement and place its claims correctly. You know what a Transformer block contains, why those components are there, and how the design trades off expressiveness against efficiency. The rest of the book builds on this without re-explaining it.

Next up — Chapter 5: Training Large Models. Tomorrow we look at how these architectures are actually trained: where the data comes from, what hardware does the work, what the optimization process looks like in practice, and why training a frontier model now takes months and costs hundreds of millions of dollars.

Want the full picture? The book treats the Transformer with the visual detail it deserves: block diagrams, attention flow charts, encoder/decoder topology comparisons, and the math explained in plain English alongside the equations. Grab LLM Primer I on Amazon →

Chapter 3 — Neural Networks for Language… · 2 parts

Chapter 4 — The Transformer Architecture: Inside the Engine of Modern AI