Chapter 2 — Probability, Tokens, and Text
This is Part 2 of a series walking through LLM Primer I: How Generative AI Works. In yesterday's post on Chapter 1, we established what an LLM actually is: a guess-maker for text. Today we get specific about what that means.
Before the model sees anything, it sees numbers
Here's something most introductions to LLMs glide over: the model never sees your words. By the time your prompt reaches the model's first layer, it has been chopped into small pieces called tokens, and each token has been replaced by a number.
A token is usually shorter than a word. Common words like "the" or "language" are often a single token. Longer or rarer words get split into pieces — "tokenization" might become "token" + "ization", for example. This is why pricing for LLM APIs is measured in tokens rather than words, and why the same sentence in a different language can cost two or three times more to process.
How the chopping happens — methods called Byte Pair Encoding, WordPiece, and a few others — gets a careful treatment in the book. Different LLM families use different schemes, which is one reason model outputs sometimes break in surprising places when you're working with code, math symbols, or non-Latin scripts.
The whole thing is a guessing game
Once the prompt is tokenized, the model's task is shockingly simple to describe: produce a probability distribution over every possible next token. Not "the answer," not "the right token" — a distribution that says, in effect, "given everything I've seen so far, here's how likely each possible next token is."
If you ask the model "the capital of France is", the probability of the next token being "Paris" will be very high, with smaller amounts of probability assigned to "the", "located", "currently", and so on. The model then picks one of those candidates (with the choice influenced by a setting called temperature) and adds it to the sequence. Then it does the whole thing again. Then again. One token at a time.
That's it. Every essay, every translation, every code snippet, every poem ever produced by an LLM is the result of this loop running, repeatedly, with no plan, no overall design, no goal beyond producing the next plausible token.
Chapter 2 spends real time on why this works at all. The fact that pure next-token prediction, given enough scale, produces something that looks like reasoning is not obvious. It is one of the most interesting empirical discoveries in modern AI, and the book takes care to explain why.
The old way versus the new way
Before neural networks dominated, language models worked by counting. If you wanted to predict the next word, you looked at the previous two or three words, found everywhere they appeared in your training corpus, and asked: what came next, on average? This worked, sort of. It produced grammatical text, sometimes. But it had two crippling problems.
The first was sparsity. Most three-word combinations never appear in any training set, no matter how large. So the model had no opinion at all on most sequences. The second was generalization. The phrase "the dog chased the cat" and "the wolf chased the rabbit" share structure that humans see instantly, but a counting model treats them as completely unrelated. It learns nothing from one that applies to the other.
Neural language models fix both problems by learning patterns rather than memorizing combinations. They map every token to a list of numbers — an embedding — and then learn how those numbers transform across sequences. Two sentences with similar structure end up with similar internal representations, even if the model has never seen either specific sentence.
Measuring how good the guesses are
Chapter 2 closes with two metrics you'll hear about constantly: entropy and perplexity. The book takes its time with these because they're easy to misunderstand. The short version, with apologies to anyone who's seen the equations:
Entropy is uncertainty. If the model is very confident about what comes next, entropy is low. If the model is genuinely unsure, entropy is high. Perplexity is a way of expressing that uncertainty as a number you can compare across models. A lower perplexity means a model that's consistently less surprised by the text it's seeing.
You don't need to know the formulas to use these intuitions. When you read that "Model A has a perplexity of 4.2 on this benchmark," you can mentally translate: "Model A's guesses on this benchmark are pretty confident — it's averaging about 4 plausible next tokens worth of uncertainty per position." When perplexity is 50, the model is much less sure. That's enough to make sense of most research papers.
What Chapter 2 sets up
By the end of Chapter 2, you have a working mental model of the input-output loop that defines every LLM: text in, tokens out, probabilities computed, next token sampled, repeat. You know why this loop is mathematically tractable and what its limits are. And you have the vocabulary to read the rest of the book, and most LLM research, without getting tripped up.
This sets up the central question of the next few chapters: how does the model produce those probabilities? What's actually going on inside? That story starts tomorrow.
Next up — Chapter 3: Neural Networks for Language. We zoom into the computational machinery that does the actual work. How is a neural network put together? Why did earlier designs fail at language? And what does it mean to "train" billions of parameters?