Chapter 3 — Neural Networks for Language
This is Part 3 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we framed language modeling as a probability problem and saw why old counting approaches couldn't scale. Today we look at the computational machinery that replaced them — and how it evolved into the design that powers every modern LLM.
What a neural network actually is
Strip away the imagery of brains and synapses for a moment. A neural network is a long mathematical recipe with millions or billions of internal knobs, each knob a number. You feed something in (a list of numbers representing your input), the recipe transforms it through a series of steps, and a list of numbers comes out the other end.
Training the network means showing it many examples and gently adjusting all the knobs — automatically, using a process called gradient descent — so that the output for each example gets a little closer to the answer you wanted. Repeat that process across billions of examples and you eventually have a network whose knob settings encode a remarkable amount of structure about whatever you trained it on.
Chapter 3 spends time on the mechanics — embeddings, hidden layers, nonlinear activation functions, and the optimization process that updates the knobs. The book doesn't shy away from the ideas, but it explains every step so a reader without a math background can follow what's happening. If you can read a recipe, you can read this chapter.
Three shapes, and only one of them won
The history of neural networks applied to language is, broadly, the story of three architectural ideas. Each one was a real advance over its predecessor. Each one had a fatal limitation. The third one — self-attention — finally cracked the problem at scale.
The first shape is the feedforward network. You hand it a fixed-size chunk of input, it transforms that chunk, and it produces an output. Feedforward networks are excellent at many tasks, but they have a structural problem for language: language doesn't come in fixed-size chunks. A sentence might be three words or three hundred. A feedforward network has no graceful way to handle that variation.
The second shape is the recurrent neural network, or RNN. RNNs read text one token at a time, carrying a small summary — called a hidden state — forward from each step to the next. This mimics how a human reads, and it solved the variable-length problem. But RNNs had two new problems. The summary they carry forward gradually loses detail across long passages, so the model "forgets" things from earlier in the text. And because each step has to wait for the previous one, RNN training can't be parallelized across modern hardware, which made scaling them up impossibly slow.
The third shape is self-attention, which abandoned the sequential approach entirely. Instead of carrying a summary forward, every token in the sequence directly looks at every other token in the sequence — all at once — and decides which ones matter. This solved the forgetting problem (every token has direct access to every other token) and the parallelization problem (the entire sequence can be processed simultaneously on a GPU). And it's the foundation of every Transformer-based LLM.
Why attention "changed everything"
That line gets thrown around a lot, including in the title of the famous 2017 paper that introduced the Transformer architecture. Chapter 3 takes care to explain what specifically changed.
Attention is, at heart, a routing mechanism. Each token broadcasts what it's looking for ("which other token has information about my subject?") and what it offers ("here's what I represent"). The math computes a weighted average over all the other tokens, with the weights determined by how well each one matches the asking token's query. The result is that every token, after passing through an attention layer, has been enriched with relevant information from everywhere else in the sequence.
The deep reason this works is that it's both expressive and parallelizable. Expressive because it can model long-range dependencies — a token at position 1 can directly inform a token at position 1000. Parallelizable because all the weighted averages can be computed at once, as a matrix operation that modern hardware excels at. The combination is what unlocked the scaling era.
What Chapter 3 sets up
By the end of Chapter 3, you have a working understanding of why earlier neural network designs hit a wall for language, and why attention broke through. You know what training a network actually means, mechanically. And you have the conceptual scaffolding to understand why the architecture in the next chapter — the Transformer — is built the way it is.
This is the chapter where most readers stop thinking of LLMs as a mysterious black box and start thinking of them as a specific kind of engineering. That shift is the whole point of the book.
Next up — Chapter 4: The Transformer Architecture. Tomorrow we open up the box. Self-attention, multi-head attention, positional encoding, layer stacks, and the design choices that determine whether you're looking at GPT, BERT, or something in between.