1.2 Basics of Probability for Language Generation

Language is full of uncertainty. When you read the beginning of a sentence, you do not know exactly how it will end—but you have expectations. If a sentence begins with "Once upon a", you probably expect "time" more than "telescope". Large Language Models work in exactly the same way: they measure uncertainty and make predictions using probability.

In this section, we build a grounded understanding of probability as it applies to language modeling. You will learn how LLMs turn text into numbers, why probabilities naturally arise from patterns in language, and how the model uses these numbers to choose the next word. By the end, probability will not feel abstract or distant—it will feel like the most natural way to describe language generation.

Why Probability Is the Natural Tool for Language

Every word you say is chosen from a sea of alternatives. The word "mathematics" could follow "I love", but so could "music", "coffee", or "sleep". Language is rich, flexible, and rarely deterministic.

Probability gives us a way to quantify this uncertainty. Instead of guessing the next word outright, we assign a likelihood to every possible next token:

P(xₙ₊₁ | x₁, x₂, …, xₙ)

This expression reads:

“The probability of the next token given all previous tokens.”

It is the single most important equation in all of language modeling.

Random Variables: Modeling Uncertain Outcomes

A random variable is a quantity whose value is not known in advance. In language modeling, the next token is always a random variable: the model does not know which word will appear, but it knows how likely each option is.

If we let the random variable X represent the next token, then:

X ∈ V

means X must be one element of the vocabulary V.

A probability distribution then assigns each token a value between 0 and 1, with all probabilities adding up to 1. For example:


P(X = "world" | "hello") = 0.62
P(X = "everyone" | "hello") = 0.18
P(X = "potato" | "hello") = 0.004

These numbers reflect the model’s learned expectations from data. The model is not “choosing” the next word—it is sampling from a distribution.

The Probability Distribution Over Tokens

To describe all possibilities at once, we use the notation:

P(X | context)

which means “the entire probability distribution over the vocabulary.” Each token has its own probability, and the model uses this distribution to generate text. This distribution is produced by the LLM’s internal computations—the attention layers, linear transformations, and neural activations all contribute to shaping it.

A Concrete Example

Suppose the context is:

"The cat sat on the"

The model might assign:


P("mat" | context) = 0.42
P("floor" | context) = 0.19
P("ceiling" | context) = 0.0002

These probabilities reflect how often these phrases appeared in the training data and how strongly the model has internalized their patterns.

The Chain Rule: Building Sentences Token by Token

A full sentence is generated by predicting each token, one at a time, conditioned on everything before it. Probability provides a natural mathematical tool for this through the chain rule.

For a sequence of tokens x₁ … xₙ:

P(x₁, x₂, …, xₙ) =
P(x₁) × P(x₂ | x₁) × P(x₃ | x₁, x₂) × ... × P(xₙ | x₁, …, xₙ₋₁)

This expression says:

“The probability of a full sentence is the product of the probabilities of each token, conditioned on those before it.”

This is exactly how LLMs generate text: one token at a time, updating their belief about the next token as the sentence unfolds.

Why Multiplying Probabilities Is So Powerful

Multiplying probabilities creates a structure that rewards coherent sequences and penalizes unlikely ones. Sentences like:

"The cat sat on the mat."

tend to have higher overall probability than:

"Mat the on sat cat the."

The chain rule encodes grammar, meaning, and natural structure into the mathematics of prediction, which is why LLMs can produce coherent language after being trained on massive amounts of text.

Sampling: Turning Probabilities Into Words

Once the model computes a probability distribution, it needs to choose the next token. This is done through a process called sampling.

The simplest sampling method is:

Greedy decoding: choose the token with the highest probability.
Sampling: randomly choose a token, weighted by probability.
Top-k sampling: consider only the k most likely tokens.
Temperature scaling: adjust the “sharpness” of the distribution.

Each method balances creativity, coherence, and diversity differently. These choices are built on probability, so understanding probability is essential to understanding generation strategies.

Log-Probabilities: Working With Tiny Numbers

Multiplying many probabilities can produce extremely small numbers. For example:

0.4 × 0.2 × 0.1 × ...

quickly becomes tiny. Computers handle this by working with log-probabilities. The log of a small number is easier to represent and manipulate:

log(a × b) = log(a) + log(b)

which turns multiplication into addition—a far more stable operation. Don’t worry about the exact math yet; the key idea is that logs make computations more manageable.

Wrapping Up 1.2

In this section, we built a basic but powerful understanding of probability as it applies to language generation. You learned about random variables, probability distributions, conditional probability, the chain rule, and how sampling converts mathematical predictions into text.

We now understand how LLMs represent uncertainty. Next, we need a way to quantify that uncertainty—to measure how predictable or unpredictable a piece of text is.

That brings us to one of the most beautiful and important ideas in all of information theory: entropy.

Turn the page to Section 1.3 — Entropy and Information: Quantifying Uncertainty.