1.3 Entropy and Information: Quantifying Uncertainty
Probability gives us a way to describe uncertainty. But to truly understand how Large Language Models (LLMs) reason about text, we need a way to measure uncertainty. This is where information theory enters the picture—one of the most elegant mathematical frameworks ever developed.
Originally introduced by Claude Shannon in 1948 as the foundation of digital communication, information theory now sits at the core of modern AI systems. Whenever you hear about “surprise,” “uncertainty,” “likelihood,” or “cross-entropy,” you are seeing Shannon’s ideas at work inside LLMs.
In this section, we explore entropy and information with clarity and intuition. You do not need a deep mathematical background—just a willingness to think about uncertainty in a new way. By the end of this section, you will understand not only what entropy is, but why it is so fundamental to how LLMs learn, predict, and understand language.
What Is Information?
Let’s begin with a simple idea:
Information measures how much uncertainty is reduced when an outcome occurs.
If something is predictable, it doesn’t give you much information. Consider these two statements:
- “The sun rose this morning.”
- “A rare astronomical event occurred today.”
The second conveys far more information because it is unexpected. The same is true in language. Compare:
"Happy birthday to you"
vs.
"Happy birthday to the quantum algorithm"
The second phrase contains more “surprise,” therefore more information. Mathematically, information is highest when probability is low.
The Information of a Single Token
The information content of an event with probability p is defined as:
I = -log₂(p)
Let’s interpret this intuitively:
- If a token is very probable (p close to 1),
Iis small — low surprise. - If a token is improbable (p close to 0),
Iis large — high surprise.
The negative log flips the probability scale so that “unlikely events give big information.” Using log base 2 expresses information in bits.
Examples
P("the") = 0.07 → Information ≈ 3.8 bits
P("porcupine") = 0.0001 → Information ≈ 13.3 bits
Rare words provide more information—this aligns perfectly with linguistic intuition.
Entropy: The Average Uncertainty
Entropy measures the average amount of uncertainty in a probability distribution.
Entropy is the expected surprise.
Mathematically:
H = - Σ p(x) log₂(p(x))
This says: multiply the probability of each token by its information, and add them all up. High entropy means:
- many plausible next tokens,
- high uncertainty,
- the model doesn’t know what comes next.
Low entropy means the model is confident about the next token.
Examples in Language
Low entropy context:
"Good morning, how are"
The next token is likely:
"you"
Few options → low uncertainty → low entropy.
High entropy context:
"The meaning of life is"
Many plausible completions:
- "unknown"
- "42"
- "a mystery"
- "to love"
- "complex"
Lots of options → high uncertainty → high entropy.
Why Entropy Matters for LLMs
Entropy is a central quantity in LLMs because it tells us:
- how predictable a context is,
- how “creative” or “open-ended” a prediction might be,
- how much information a model expects from the next token.
More importantly, entropy is directly connected to the loss function used during training: cross-entropy. Minimizing cross-entropy is equivalent to making the model better at predicting the next token—reducing uncertainty in its probability distribution.
Cross-Entropy: The Bridge Between Probability and Learning
When a model predicts a probability distribution over tokens but the true next token is known (during training), cross-entropy measures:
“How far is the model’s prediction from the correct distribution?”
Lower cross-entropy → better predictive performance. Higher cross-entropy → worse predictions.
Although we will explore cross-entropy in detail later, the key idea is simple:
Entropy tells us how uncertain language is. Cross-entropy tells us how well the model handles that uncertainty.
This connection between probability, information, and learning is what makes information theory so foundational to LLMs.
Information and Surprise in LLM Generation
When an LLM generates text one token at a time, it is constantly managing uncertainty:
- high entropy → multiple plausible tokens → diverse outputs,
- low entropy → predictable sequences → stable outputs.
This is why sampling strategies like temperature and top-k sampling affect creativity: they directly modify the entropy of the distribution before sampling.
Conversely, perplexity—a common evaluation metric—is just another way of expressing entropy. A model with lower entropy is better at predicting text.
Wrapping Up Section 1.3
In this final section of Chapter 1, we explored entropy, information, surprise, and cross-entropy—the core quantities that allow LLMs to measure, evaluate, and learn from uncertainty. These ideas form the theoretical backbone of next-token prediction.
You now understand how LLMs quantify unpredictability, how rare tokens carry more information, why probability alone is not enough, and why information theory is essential for training and evaluation.
With Sections 1.1, 1.2, and 1.3 complete, you now have a full mathematical intuition for language modeling—from notation, to probability, to uncertainty itself.
Wrapping Up Chapter 1
Chapter 1 revealed something fundamental: language is uncertainty, and LLMs are mathematical systems built to manage it.
With this foundation in place, you are ready to move beyond the mechanics of tokens and probabilities and step into a broader perspective: what LLMs are, how they are trained, why they work, and how they fit into the history of language processing.
In the next chapter, we leave pure mathematics for a moment and explore the conceptual and architectural landscape of modern LLMs—how they evolved, why Transformers became dominant, and what makes today’s models so remarkably effective.
Turn the page to begin Chapter 2 — LLMs in Context: Concepts and Background.