Chapter 2 — LLMs in Context: Concepts and Background

Second post of the chapter-by-chapter walkthrough of LLM Primer II: Language Models Through Mathematics. The bridge chapter — from Book I's plain-language story to Book II's math.

What this chapter is doing

Chapter 1 set up the symbols. Chapter 2 sets up the object. Before we start deriving attention or chasing softmax, we agree on what an LLM is, what it isn't, and what the core ideas mean — pretraining, parameters, scale, the special character of language as data.

If you read Book I, much of this will feel familiar. The chapter takes the same destinations and approaches them once more, but this time with one eye on the math the rest of the book will use. It's the chapter where Book II earns the right to be called a sequel.

2.1 What is a large language model

The opening definition is the one we will lean on for the next three hundred pages: an LLM is a function that takes a sequence of tokens and returns a probability distribution over the next token. Everything else — generation, conversation, reasoning, code, translation — is what happens when you compose that function with itself, one token at a time.

The chapter is careful here. It distinguishes the model (the function) from the system (the model plus a sampler, a tokenizer, a context window, a tool layer). Confusing the two has been the source of more bad takes about LLMs than almost anything else.

One line: an LLM is a probability distribution over what comes next. Everything else is engineering around that one fact.

2.2 Pretraining, parameters, and scale

Three words that get thrown around — and that mean specific things.

Pretraining is the long phase that consumes most of the compute budget. The model is shown vast amounts of text and asked, again and again, to predict the next token. There is no human in this loop. The signal comes from the data itself: the next token was always going to be that one, and the model is rewarded when its distribution puts mass on the right place.

Parameters are the numbers inside the function. Modern frontier models have hundreds of billions of them. Training is the act of nudging each one, very slightly, billions of times, until the overall function behaves the way you want.

Scale is what happens when you grow all three things at once — parameters, data, compute. The book will spend an entire later chapter on the empirical laws that link them. Here, scale is introduced as the single most surprising fact of the modern era: making this same simple recipe larger has, repeatedly, produced qualitatively different behavior.

2.3 Language as data

Section 2.3 makes a point that is easy to miss: language is not generic data. It carries structure that the rest of the book will exploit — sequentiality, long-range dependence, compositionality, an enormous combinatorial space of valid sentences over a small alphabet.

This is where the chapter sneaks in a quiet bridge to Book II's math. Vector representations of words (the embeddings of Chapter 3) are the answer to a real engineering problem: how do you let a neural network see two synonyms as close, two opposites as far, and a sentence as the geometric arrangement of its parts? The chapter introduces the question. The math comes next.

2.4 Why transformers changed everything

The chapter closes with the 2017 paper "Attention Is All You Need." Not as a piece of history, but as a piece of architecture. Before the transformer, sequence models read text one token at a time, like a person reading aloud. After the transformer, every token could attend to every other token in parallel. This single change made it possible to train at the scale that made the rest of the field possible.

The chapter sketches what attention is doing in pictures (the equations come in Chapter 4). The promise: in twenty more pages, the same picture will be a system of equations you can derive, debug, and reason about.

Worth holding onto: the transformer did not become dominant because it was cleverer. It became dominant because it was more parallelizable. Hardware shaped which architecture won. That fact will keep coming back through this book.

What Chapter 2 sets up

At the end of Chapter 2, you know exactly what object the next ten chapters will be doing math to. You know what its inputs and outputs are. You know what pretraining is for, what parameters are, what scale buys, and why transformers won. You are ready to open the toolbox.

Next — Chapter 3: Mathematical Tools for Language Models. The probability you need, the linear algebra you need, and embeddings as the first place those two meet inside an LLM. Short, dense, and the last preparation chapter before the math turns on in Part II.

Want the full picture? The book unpacks each of the four ideas in this chapter with worked examples and small diagrams, and ties each back to the math that arrives in later chapters. View LLM Primer II on Amazon →