1.1 Getting Comfortable with Mathematical Notation
Mathematics is a language. Just like English or Japanese, it has grammar, structure, and conventions. And just like any language, once you understand how its symbols fit together, the ideas behind them become far easier to grasp.
In the context of Large Language Models (LLMs), mathematical notation is not a barrier— it is a shortcut. Instead of explaining an idea in a paragraph, a single expression can capture it with precision and clarity. As we move deeper into probability, information theory, and model behavior, we will rely on mathematical notation frequently. This section is dedicated to making sure you feel comfortable with it.
You do not need to be a mathematician to understand this material. The goal here is confidence and familiarity, not perfection. We will introduce each symbol carefully, explain why it matters, and show how it is used inside LLMs. By the end of this section, the notation used throughout the rest of this book will feel natural, intuitive, and even elegant.
The Building Blocks: Variables, Sequences, and Functions
Let us start with the simplest question: How do we represent text mathematically? LLMs never process raw text; they operate on numerical representations. To express these mathematically, we need a consistent way to define:
- tokens — the smallest units of text an LLM reads,
- sequences — ordered lists of tokens,
- functions — the model’s transformations from input to output.
Representing Tokens
A single token is often represented using a symbol such as x or t. For example:
x = "hello"
In mathematical notation, we treat this token as an element of a vocabulary set, usually written as:
x ∈ V
The symbol ∈ means “is an element of,” while V represents the vocabulary— the complete list of tokens an LLM knows. This could be 50,000 tokens or even more, depending on the tokenizer.
Representing Sequences of Tokens
Tokens only become meaningful when arranged in order. To express sequences mathematically, we use the notation:
x₁, x₂, x₃, …, xₙ
Here:
x₁is the first token,x₂is the second,- and so on up to
xₙ, the nth token.
The subscript serves as an index—it tells us the token’s position. Ordering is crucial because meaning changes with order. Compare:
"dog bites man"
vs
"man bites dog"
Mathematically, these are different sequences even though they contain the same tokens. LLMs learn this ordering structure through attention mechanisms, but the notation allows us to express it cleanly.
Functions and Mappings
A core idea of LLMs is that they are functions. They take a sequence of tokens and output a probability distribution over the next token. In notation:
f(x₁, x₂, …, xₙ) = probability distribution over V
The function f is the model. The input is the token sequence. The output is a set of probabilities—how likely each possible next token is. This will become incredibly important when we discuss text generation.
Conditional Expressions: A Core Idea in Language Modeling
One of the most important mathematical ideas in language modeling is the notion of conditionality: the idea that the next token depends on the previous ones. We express this using:
P(xₙ₊₁ | x₁, x₂, …, xₙ)
This reads as:
“The probability of the next token
xₙ₊₁given the tokensx₁toxₙ.”
The vertical bar | means “given that,” or “conditioned on.” This simple idea— predicting the next token given all previous tokens—is the heart of all LLMs.
Conditional notation lets us write extremely complex ideas succinctly. For example:
P("world" | "hello")
may be high, while:
P("potato" | "hello")
may be much lower. Once you grasp this idea, probability becomes a natural tool for modeling language.
Why Notation Matters for LLMs
As we go deeper into LLM mechanics—attention, embeddings, optimization—the notation we use in this section will appear over and over. These symbols will help us define:
- how uncertainty is expressed,
- how likelihoods are computed,
- how models update their internal representations,
- and how predictions are generated.
With this grounding, mathematical notation becomes not an obstacle, but a powerful lens through which to understand how LLMs relate structure, meaning, and uncertainty.
Wrapping Up 1.1
In this section, we established the mathematical language we will use throughout the rest of the book. You learned how tokens, sequences, functions, and conditional notation allow us to express the behavior of LLMs clearly and precisely. With these tools in hand, you are ready to understand one of the most important concepts in language modeling: probability.
Probability is the engine of generation. Every word an LLM outputs is the result of a probability distribution shaped by context, data, and mathematical structure.
Turn the page to begin Section 1.2 — Basics of Probability for Language Generation.