LLM Primer II — Language Models Through Mathematics: Series Introduction & Index

"What seems like magic is often just math we don't yet understand." Welcome to Book II in the LLM Primer series — and to the walkthrough that goes with it. Over the next fourteen days, one post per chapter, we'll open the machinery of modern language models and look at the math that makes them run.

Why Book II exists

Book I in this series — LLM Primer I: How Generative AI Works — was the plain-language on-ramp. It told the story of LLMs from the outside: what they are, what they do, where they trip, how systems are built around them. No math required. If you read along with that series in February, you already have the map.

Book II is the math. Not the kind that runs you through a wall in the second paragraph, but the kind that finally answers the questions Book I left ringing: what exactly is attention computing? what is softmax actually doing? why do larger models keep getting better in such a predictable way? what is the engineering really up against when it tries to scale?

And here's the promise this book makes — and the spine of every chapter that follows it. Every equation that matters is in the book, fully derived. But each one arrives wrapped in a story, an analogy, and a worked numerical example. You meet the idea first, in plain words. The symbols come after.

Who this is for: engineers, technical leaders, students, and the endlessly curious — anyone who wants to move from using these tools to genuinely understanding them. No PhD. Just curiosity and a little patience.

How LLM Primer II is organized

The book sits in four parts, and the order is doing real work.

Part I — Mathematical Foundations (Chapters 1–3). The vocabulary and the intuition. Probability without the panic. Vectors and embeddings the way they actually show up in language models. Information theory just deep enough to make entropy mean something when you meet it again later.

Part II — The Mathematics of Transformers (Chapters 4–7). The heart of the book. Attention derived from first principles. Position encodings — including a clean look at RoPE and the Fourier idea underneath it. Why the transformer block (attention + feed-forward) has the expressive power it has. And then the engineering side: complexity, memory, FlashAttention, low-rank tricks, the variants that make the math fit on real hardware.

Part III — Optimization, Training, Post-Training, Evaluation, and the Math of Measurement (Chapters 8–11). Where the math gets honest about what's still mysterious. Why overparameterized models generalize at all. Scaling laws. Data preprocessing and the quiet way it shapes everything. Mini-batches, parallelism, numerical precision — the mathematics that actually run inside a training cluster. Then the post-training pipeline that turns a pretrained model into an assistant — supervised fine-tuning, reward modeling, RLHF, DPO — and the math of measurement: evaluation, calibration, and inference-time decoding.

Part IV — Applications, Limitations, and the Road Ahead (Chapters 12–14). The book lands. Real-world use of LLMs. The hard limits — compute, energy, bias, ethics. And a closing chapter that maps the tools, libraries, and habits that turn understanding into practice.

Two appendices follow: a math cheat sheet, and a statistical perspective on LLMs that re-tells the whole story in the language of probability.

How to read these posts

Each daily post is a guided tour of one chapter — what question it's answering, what mathematical moves matter, and what intuition it leaves you with. Five to eight minutes each. Light on equations on the page, but honest about which ideas the book derives in full.

You can read this series cold, with no background. It will give you the shape of the math without putting any on your shoulders. If a chapter grabs you, the book itself goes the distance.

One thing worth saying up front: this is not a substitute for the book. It is a map to the book. The proofs, the worked examples, the diagrams, the exercises — those live in the chapters themselves.

The schedule — what arrives when

March 3 — Chapter 1: Mathematical Intuition for Language Models. Notation without intimidation, probability for generation, and entropy as a measurement of uncertainty.

March 4 — Chapter 2: LLMs in Context: Concepts and Background. What an LLM is, the core ideas behind it (pretraining, parameters, scale), language as data, and why the transformer changed everything.

March 5 — Chapter 3: Mathematical Tools for Language Models. The probability and linear algebra you actually need, with embeddings as the bridge between words and geometry.

March 6 — Chapter 4: Attention — The Core Mechanism. Self-attention from equations, geometry, and intuition; multi-head structure; softmax with its stability and temperature knobs; and attention seen as a kernel method.

March 7 — Chapter 5: Position, Order, and Sequence Structure. Sinusoidal encodings, relative position, RoPE, and a Fourier-eye view of the whole apparatus.

March 8 — Chapter 6: Transformer Blocks and Representation Power. Feed-forward layers, nonlinearity, why "attention + FFN" is the right recipe, and what depth and width actually buy you.

March 9 — Chapter 7: Efficiency and Transformer Variants. The complexity of attention, the GPU memory and throughput math, FlashAttention, multi-query attention, and the low-rank tricks that keep big models running.

March 10 — Chapter 8: How Models Learn. Generalization in overparameterized models, the implicit bias of gradient descent, scaling laws, and the open mathematical questions that remain.

March 11 — Chapter 9: Training at Scale. The mathematical consequences of data preprocessing, mini-batch and parallelism math, and the numerical precision that determines whether a training run survives.

March 12 — Chapter 10: Post-Training and Alignment Mathematics. Supervised fine-tuning, reward modeling, RLHF with a KL leash, and the elegant DPO derivation that collapses the whole pipeline into one supervised loss.

March 13 — Chapter 11: Evaluation, Calibration, and Inference. Perplexity and what it really measures, benchmark math and where benchmarks mislead, calibration, and the decoding strategies that shape the model's outputs at inference time.

March 14 — Chapter 12: Real-World Applications of LLMs. Text generation, summarization, QA, translation, reasoning — what the math implies about each task in practice.

March 15 — Chapter 13: Limitations, Risks, and Open Challenges. Compute and energy ceilings. Bias, ethics, societal impact. The honest list of what isn't solved.

March 16 — Chapter 14: Practical Knowledge for Engineers. How to keep going deeper, and the tools and libraries that turn understanding into shipping work.

Want the whole picture right now? LLM Primer II: Language Models Through Mathematics is the book this series is mapping — richly illustrated, with worked examples, end-of-chapter takeaways, exercises with solutions, a math cheat sheet, and a full glossary. View on Amazon →

See you tomorrow, with Chapter 1.