Chapter 7 — Beyond Next-Token Prediction

This is Part 7 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we looked at the full adaptation stack — from prompts to alignment. Today we extend the LLM beyond pure generation. Embeddings, retrieval, hybrid memory, and the move into multimodal inputs.

Embeddings: meaning as geometry

If a Transformer's strength is that it produces rich internal representations for each token, the natural next question is: what if we used those representations directly, instead of as a step on the way to generating text?

That's the idea behind embeddings. An embedding model takes a piece of text — a word, a sentence, a paragraph, a document — and produces a list of numbers (typically a few hundred to a few thousand) that captures its meaning. Two pieces of text with similar meanings produce similar number lists. Two with different meanings produce different ones.

Once you have embeddings, you can do remarkable things with them. You can search for documents by meaning instead of by keyword: ask "how do I cancel my subscription" and find pages that talk about "ending my plan" or "discontinuing service" even though no word matches. You can cluster documents by topic without any labels. You can detect duplicates, find near-misses, and route queries to the right system.

Key idea: Embeddings are the bridge between language models and search. They turn meaning into geometry, and once meaning is geometry, every standard search and clustering algorithm becomes available.

Generation versus retrieval

Generation and retrieval are often presented as competing approaches, but they're not. Generation invents text from internalized patterns. Retrieval selects existing text from a stored corpus. Each has its strengths.

Generation is creative, flexible, and capable of producing answers to questions no one has ever asked. It's also capable of confidently producing answers that are wrong — the model has no way to verify what it's saying. Retrieval is the opposite: limited to what's in the library, but grounded in real, verifiable source material.

The interesting move is to combine them. A model that retrieves first and then generates can produce fluent, on-topic, customized text while staying anchored in actual documents. This is the central design pattern that has emerged for production LLM systems.

Hybrid memory: the model plus a library

The book treats this as a major architectural concept rather than a single technique. The idea is to give the model two kinds of memory. Its parametric memory lives in its trained weights — broad, dense, but fixed at training time. Its non-parametric memory lives in an external store — narrow, specific, and updatable in real time.

When a query comes in, the system embeds it, searches the external store for relevant material, and passes both the original query and the retrieved material to the model. The model then composes an answer using both — its broad understanding of language and the specific, current information it just received.

This pattern has practical consequences. Updating the knowledge a system can answer about no longer requires retraining; you update the external store. Citations become possible because the system knows which document it drew from. Confidence calibration improves because the model can tell whether it had relevant context or not.

RAG, more carefully

The most common implementation of this hybrid pattern is called Retrieval-Augmented Generation, or RAG. It's worth understanding the actual steps, because most production AI assistants you'll work with are RAG systems under the hood.

The flow is straightforward. First, you embed your knowledge base — documentation, customer messages, internal wikis — and store the embeddings in a vector database. Second, when a query arrives, you embed it the same way and find the top-k most similar pieces from your knowledge base. Third, you assemble a prompt that includes the user's question and the retrieved pieces, and you send that to the model. Fourth, the model generates an answer using the retrieved material as grounded context.

Every step has subtleties that determine whether the system works well or poorly. Chunking — how you split your source documents — matters enormously. Reranking — how you choose which retrieved candidates actually go into the prompt — matters more than people realize. The book walks through what works and what doesn't, based on real deployments.

Important: Most failed enterprise AI deployments aren't failing at the model layer. They're failing at the retrieval layer. The model produces correct-looking output, but the retrieved context didn't actually contain the right information, and the model — fluent as ever — confabulated a plausible-sounding answer anyway.

Multimodal extensions

Chapter 7 closes by extending the framework beyond text. Images, audio, and video can all be tokenized — converted into sequences of small pieces that the same Transformer machinery can process. A vision encoder turns an image into a sequence of patches. An audio encoder turns sound into a sequence of feature vectors. Both can be aligned with text embeddings so the model can reason across modalities.

The original generation of multimodal systems used separate encoders for each modality and stitched the outputs together at a fusion layer. The current generation is more elegant: it treats all modalities as just more kinds of tokens fed into a single shared Transformer. This is why modern frontier models can smoothly mix text, images, and speech in a single conversation.

What Chapter 7 sets up

By the end of Chapter 7, you understand how LLMs become useful in the wild. You can reason about the embeddings → retrieval → generation pipeline that powers most enterprise AI. You can read announcements about multimodal models and place them correctly in the architectural evolution. And you have the conceptual tools to design or evaluate a RAG system for your own work.

Next up — Chapter 8: Using LLMs in Applications. Tomorrow we move into practice. Chatbots, summarization, code generation, knowledge extraction, evaluation, and the rise of agentic systems where the model is the controller, not the controlled.

Want the full picture? The book walks through the embeddings/retrieval/generation pipeline in detail with diagrams of the RAG flow, the trade-offs at each layer, and the multimodal architectural shift visualized clearly. Grab LLM Primer I on Amazon →

Chapter 7 — Beyond Next-Token Prediction: Embeddings, Retrieval, and Multimodality