Chapter 2 — LLMs in Context: Concepts and Background

Introduction

In Chapter 1, we built the mathematical foundation for understanding language modeling: notation, probability, uncertainty, entropy, and information. With these tools, we now shift our perspective from pure mathematics to the broader conceptual landscape of Large Language Models (LLMs). Before we explore attention mechanisms, optimization, or training dynamics in later chapters, we must first understand what LLMs actually are and why they work the way they do.

Modern LLMs—GPT, Claude, Gemini, Mistral, LLaMA, and others—are more than large neural networks trained on text. They are statistical systems shaped by pretraining, architecture, scaling laws, and decades of progress in natural language processing. This chapter connects those ideas into a coherent narrative that explains not only what LLMs do, but why they have become the cornerstone of today’s AI revolution.

The Purpose of This Chapter

Many introductions to LLMs either jump straight into architecture diagrams or stay at a surface level (“LLMs predict the next word”). This chapter instead aims to give you the context that unlocks deeper understanding. You will learn how LLMs fit into the history of NLP, how key ideas emerged, and what conceptual breakthroughs transformed them from research curiosities into general-purpose reasoning engines.

We will explore:

what exactly defines a "Large Language Model,"
why pretraining matters and how models acquire general knowledge,
what model parameters are and why they matter (but not for the reasons many think),
how scaling laws revealed predictable patterns of improvement,
and how Transformers replaced previous architectures by offering a more expressive mathematical mechanism for understanding language.

These concepts will serve as the conceptual backbone for the rest of the book. Once you see the full landscape, later chapters—on attention, positional encodings, model efficiency, optimization, and large-scale training—will feel far more intuitive and grounded.

What This Chapter Covers

2.1 What Is a Large Language Model?

We begin with a precise but accessible definition of LLMs. Instead of simply saying “LLMs predict the next word,” we explore what that really means: how LLMs operate as high-dimensional statistical functions, how they map sequences to probability distributions, and what makes a model “large.” This section lays the conceptual groundwork for everything that follows.

2.2 Core Concepts Behind LLMs

Before Transformers, LLMs were not feasible. This section explains why. We cover the essential ideas that give LLMs their capabilities: pretraining on massive datasets, the role of parameters in capturing knowledge, and the scaling laws that revealed a simple truth—larger models trained on more data predictably get better.

2.3 Fundamentals of Natural Language Processing (NLP)

LLMs did not appear in isolation. They emerged from decades of NLP research: tokenization, embeddings, sequence modeling, and earlier architectures like RNNs and LSTMs. This section connects LLMs to the broader history of NLP and highlights why earlier approaches could not generalize as effectively.

2.4 Why Transformers Changed Everything

Here we explore the architectural breakthrough that redefined AI. Attention mechanisms allowed models to capture long-range structure and meaning without the bottlenecks of earlier architectures. We explain the conceptual shift from sequential processing to parallel attention—and why this mathematical idea became the foundation for all modern LLMs.

Why Chapter 2 Matters

Understanding the math of Chapter 1 is powerful, but without context it can feel abstract. Chapter 2 brings that math into the real world. Here we clarify:

how probability becomes prediction,
how parameterized functions become knowledge,
how scaling laws shape model design,
and why attention—and not merely size—made LLMs possible.

After this chapter, you will have a clear mental model of what LLMs are and why they work. This context makes every subsequent technical detail far more intuitive.

Where We Go Next

We begin our journey by addressing the most fundamental question in the entire LLM ecosystem:

What exactly is a Large Language Model?

The answer is more subtle—and far more interesting—than simply “a neural network that predicts the next word.” In the next section, we will build a precise, meaningful definition that connects mathematics, architecture, and intuition.

Turn the page to begin Section 2.1 — What Is a Large Language Model?