Chapter 11 — Cutting-Edge Research: MoE, Reasoning Models, and the New Scaling Axis

Published on: 2026-02-28 Last updated on: 2026-06-05 Version: 3
Chapter 11 — Cutting-Edge Research: MoE, Reasoning Models, and the New Scaling Axis

Chapter 11 — Cutting-Edge Research

This is Part 11 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we covered safety, ethics, and trust. Today we look forward. Chapter 11 covers the research directions that have most shaped the field over 2024–2026, and one of them in particular changed everything.


Mixture of Experts: production, not research

Until a couple of years ago, every Transformer-based LLM activated every parameter for every input. A 70-billion-parameter model used all 70 billion parameters to predict every next token. This is computationally wasteful — most of the parameters aren't relevant to most inputs.

Mixture-of-Experts (MoE) architectures fix this. The model contains many specialized sub-networks, called experts, but only a few of them are activated for any given input. A small gating network decides which experts to call. The result is a model with enormous total parameter count — making it capable — but bounded per-token compute — making it efficient.

Key idea: MoE decouples capacity from compute. A model can be 600 billion parameters in total while activating only 30 billion per token. This is one of the main reasons frontier models have continued to improve while inference costs haven't grown proportionally.

The 2026 edition treats MoE as production reality rather than research because that's what it is now. Several major frontier model families ship MoE architectures. The book walks through how the routing works, what the load-balancing challenges are, and why this architectural pattern is likely to dominate for the foreseeable future.

Memory mechanisms

Standard LLMs have one kind of memory: parameters. Once training is done, the model's knowledge is fixed until the next training run. Research on retrieval and memory mechanisms tries to give models a second kind of memory — external, updatable, and queryable at inference time.

RAG, which we covered in Chapter 7, is the most common implementation, but it's part of a larger family. Differentiable memory modules allow gradient flow through retrieval operations, so the model can learn how to retrieve effectively. Long-context memory mechanisms compress earlier portions of the conversation so the model can effectively "remember" more than its context window allows. The book covers each direction and discusses what's mature versus speculative.

Native multimodality

The early multimodal models used separate encoders for vision and language, stitched together at a fusion layer. The current generation has moved toward something more elegant: tokenize images, audio, and video directly, and feed them through the same Transformer as text. The architecture doesn't know or care which kind of token it's processing.

This is why modern frontier models can smoothly mix modalities in a single conversation, why a model can look at a photo and describe it while continuing the previous text conversation, and why some models now accept video as a first-class input. The book walks through what this architectural shift implies for context budget, latency, and the kinds of tasks you can throw at these systems.

Continual learning, honestly

Almost every shipped LLM is frozen at training time. Updating its knowledge means a full retraining or fine-tuning cycle. Continual learning is the research direction that tries to let models update their parameters incrementally, in production, without forgetting what they already knew.

This is harder than it sounds. The main obstacle is called catastrophic forgetting: when you train a neural network on new data, it tends to overwrite the patterns it learned from old data. Solving this reliably at scale remains an open problem. The book is honest about what's working and what isn't, and why most production systems still rely on retrieval rather than continual learning when they need up-to-date information.

The new scaling axis: reasoning models

This is the section I'm most excited about in the 2026 edition. Over 2024–2026, a new family of models emerged — sometimes called reasoning models or chain-of-thought models or inference-time scaling models. They've changed how the field thinks about capability.

The mechanism is straightforward in outline. A reasoning model is trained — typically through a combination of preference optimization and reinforcement learning on tasks with verifiable outcomes — to generate long internal chains of intermediate tokens before emitting its final answer. These intermediate tokens function as working memory. They let the model decompose problems, explore candidate approaches, check its own arithmetic or logic, and revise where it detects errors. The user sees only the final answer; the model used the intermediate trace to get there.

What distinguishes this from ordinary "chain-of-thought" prompting is where the capability lives. Chain-of-thought prompting coaxes a general-purpose model into reasoning by shaping its prompt. Reasoning models are trained to reason — the behavior is built into the policy, not the prompt.

Important: Inference-time scaling changes the operational shape of the system. Latency and cost per request are no longer fixed — they vary by an order of magnitude depending on how much reasoning the model decides to do. Application design must accommodate this variability, with streaming, cancellation, and timeout policies that pre-reasoning models rarely required.

Capability can now be increased along two largely independent axes. The training axis determines what the model has learned from data. The inference axis determines how much deliberation the model applies to any particular input. A smaller model allowed to reason extensively can sometimes outperform a larger model answering in a single pass. This reframes the entire scaling-cost trade-off that has governed model selection.

Future directions

The book closes Chapter 11 with the open research questions. Efficiency — doing more with less compute. Reasoning — making the model more reliable at multi-step thinking. Alignment — keeping behavior good as capability grows. Architecture — whether the Transformer stays dominant or is replaced by something fundamentally different.

No single breakthrough is expected to dominate the next few years. Progress is likely to come from the integration of many techniques, each contributing a piece. That's a less satisfying narrative than "the next big thing," but it's the honest one.

What Chapter 11 sets up

By the end of Chapter 11, you understand the major research directions shaping the field today. You can read announcements about new frontier models and place their architectural claims correctly. You have a framework for thinking about what comes next — both what's likely and what's uncertain.


Next up — Chapter 12: Building Your Own LLM System. The final chapter of the book. Tomorrow we close the series with what it takes to actually construct an end-to-end LLM system — datasets, training pipelines, evaluation frameworks, the integrated stack, and the case-study patterns that successful deployments share.

Want the full picture? Chapter 11 in the book is substantially expanded in the 2026 edition, with dedicated sections on reasoning models and native multimodality that didn't exist in the first edition. Grab LLM Primer I on Amazon →

SHO
SHO
CTO of Receipt Roller Inc., he builds innovative AI solutions and writes to make large language models more understandable, sharing both practical uses and behind-the-scenes insights.