2.1 What Is a Large Language Model?

Large Language Models—or LLMs—have become central to modern artificial intelligence. They write essays, explain code, generate ideas, summarize long documents, create translations, reason through problems, and increasingly perform tasks that look remarkably close to human thinking. But beneath this growing list of capabilities, an LLM is ultimately something simple—and something very specific:

An LLM is a mathematical function that maps a sequence of tokens to a probability distribution over the next token.

This definition is compact, but it contains the key idea behind all modern language models. To understand what it means, we must unpack it carefully. What is a token? What is a sequence? What is a probability distribution? And why does predicting the next token turn out to be so powerful that it enables reasoning, creativity, and problem-solving?

This section answers those questions and gives you a clear picture of what an LLM is at its core—without assuming any prior knowledge of deep learning.

LLMs as Functions

When we say an LLM is a function, we mean that it behaves like a machine that takes some input and produces some output. If the input is a sequence of tokens (such as words, subwords, or characters), the output is:

a probability distribution over every possible next token.

Formally, we write:

f(x₁, x₂, …, xₙ) = P(X → next_token | x₁, x₂, …, xₙ)

Here:

x₁ … xₙ is the context (the tokens so far),
f is the model,
P is the probability distribution the model outputs.

This is the mathematical essence of an LLM. The “intelligence” we observe emerges entirely from the patterns learned through this mapping.

What Exactly Is a Token?

LLMs do not operate directly on whole words or sentences. They use smaller units called tokens. A token might be:

a full word (“apple”),
a subword (“inter-” or “-tion”),
a punctuation mark (“.”),
a character (rare in modern models).

Tokens are chosen by a tokenizer—a preprocessing algorithm that splits text into pieces that are both compact and expressive. When we say an LLM sees text, we really mean:

It sees text as a sequence of numbers, each representing a token.

LLMs and Probability Distributions

Unlike deterministic programs, LLMs never “decide” the next word outright. They produce a probability distribution like:


"world": 0.62
"everyone": 0.17
"there": 0.10
"potato": 0.0004
...

The model ranks all possible next tokens and then typically selects one through sampling or greedy decoding. This probabilistic nature is what gives LLMs flexibility and creativity. The same prompt can yield different responses because the model operates on uncertainty, not fixed rules.

LLMs as High-Dimensional Statistical Engines

LLMs are not symbolic systems. They do not follow explicit rules about grammar, logic, or meaning. Instead, they learn statistical regularities from data.

During training, a model processes billions or trillions of tokens and gradually learns:

patterns in how words relate to each other,
how ideas flow from one sentence to another,
how facts, reasoning steps, and explanations appear in text,
how structure emerges across paragraphs and documents.

These patterns are stored in the model’s parameters—the numerical values inside the neural network. The number of parameters is what makes a model “large,” but it is the structure of the model and the behavior learned from data that give it power.

What Makes a Model “Large”?

When people describe LLMs, they often refer to their size: 7B, 70B, 175B parameters. But what does “large” really mean?

In the context of LLMs:

Large refers to the number of parameters (weights) a model has.
This increases the model’s expressive capacity.
Larger models can store more patterns, relationships, and structures.

However—this is important—the number of parameters alone does not tell the full story. Training data size, compute, model architecture, and optimization strategy all matter just as much. A poorly trained 100B–parameter model can underperform a well-trained 20B model.

We will explore this in detail in the next section on scaling laws.

LLMs as Next-Token Predictors—But Much More

At first glance, predicting the next token may seem too simple to produce reasoning, analysis, or creativity. But this simplicity is an illusion. Because language encodes reasoning, structure, facts, and logic, learning to predict text at scale forces the model to learn these deeper patterns implicitly.

The next-token prediction objective is not a limitation—it is a doorway into general intelligence.

This is why LLMs can translate languages, follow instructions, or write code even if they were never explicitly trained for those tasks. The model learned the underlying linguistic and conceptual structures required to perform them.

Why This Definition Matters

Understanding an LLM as a probabilistic function helps demystify how these systems work. It also clarifies why many common misconceptions about AI are inaccurate.

An LLM is:

not a database of facts,
not a rules-based system,
not a symbolic reasoning engine,
not a human-like mind,
but a mathematical predictor built on statistical patterns.

Later chapters will explore how attention mechanisms, embeddings, and optimization create these predictions. But for now, the key idea is simple and fundamental:

An LLM predicts tokens so well that the behavior looks like intelligence.

Wrapping Up 2.1

In this section, we defined a Large Language Model with clarity and precision. You now understand what an LLM is, how it represents text, why it uses probabilities, and why “next-token prediction” is a surprisingly powerful objective.

With this foundation, we can now explore the deeper ideas that give LLMs their capabilities: pretraining, parameters, and scaling laws. These concepts explain why models improve as they get larger, how they learn general-purpose knowledge, and how researchers discovered simple rules that govern the growth of AI systems.

Turn the page to begin Section 2.2 — Core Concepts Behind LLMs.