Chapter 1 — The Evolution of RAG Architecture
First post of the chapter-by-chapter walkthrough of LLM Primer III: Enhancing Enterprise AI with RAG. In which a base model’s two structural limits — frozen knowledge and missing provenance — turn out to have a single architectural answer, and the answer has grown four faces in three years.
Why this chapter exists
A transformer trained on a fixed corpus has two limits no amount of additional training fully erases. Its knowledge ends on the day the corpus did. And it cannot tell you which document a sentence came from, because the sentence is a statistical average over many, not a quote from any one. The first failure produces confidently wrong answers about anything recent. The second produces confidently wrong citations. Together they produce the now-familiar enterprise pathology: an answer that reads like authority and links to a clause that does not exist.
RAG is the architectural answer to both at once. You stop asking the model to know everything in advance and start handing it the relevant material at inference time, retrieved from a corpus you control. The corpus updates without retraining. The retrieved passages become citations because you fetched them on purpose. The model’s job shrinks from recall to synthesis. The rest of the chapter is the story of how that simple move grew, over three years, into four progressively more capable architectures.
1.1 Naive RAG: embed, retrieve, stuff
The simplest form is what every public tutorial still describes. Offline: split the corpus into chunks, embed each chunk, write the vectors to an index. Online: embed the query, fetch the top-k nearest chunks, concatenate them into a prompt, send to the model, return the answer with the chunks listed as citations. Two function calls and a vector search.
The demo works. The product rarely does. Nearest-neighbour similarity is a proxy for relevance, not a measurement of it, and embedding models trained on general web text routinely confuse apple orchard yields with Apple’s quarterly earnings. The chunker has no signal about where sentences end or tables begin. A single retrieval pass cannot serve a question whose answer is spread across three documents. And when retrieval fails, the model synthesizes from whatever came back — the citations are real, but they support none of the answer.
1.2 Advanced RAG: heuristics around the same pipeline
The second posture keeps the embed-retrieve-generate backbone and adds processing before and after the retrieval call. Pre-retrieval enhancements aim at the query: rewriting, expansion, decomposition, HyDE (drafting a hypothetical answer and embedding that as the query). Post-retrieval enhancements aim at the candidates: a cross-encoder reranker that scores query and passage together rather than embedding them apart, deduplication, metadata filtering, context compression.
The gains are not small. A cross-encoder reranker on top of a vector retriever typically moves top-5 relevance from the 50–70% band into the 75–90% band. Query rewriting adds another five to ten points where the original phrasing was ambiguous. Most production systems labelled simply “RAG” today are running this architecture, and for a wide class of enterprise problems — internal documentation Q&A, support deflection, knowledge-base search — it is the right level of investment. What it does not give you is flexibility. Every query still goes through the same pipeline.
1.3 Modular RAG: composable components, explicit routing
By 2024 both research and tooling had converged on Modular RAG. The same techniques are still present, but they are exposed as discrete, swappable components and the pipeline is assembled per request. A router decides which retrievers to call — perhaps a dense vector index, a BM25 index, a SQL store, an external API — and the results are fused, often via reciprocal rank fusion. The reranker is selected by query type. The generator is selected by required quality tier. The architecture has become a graph of components rather than a line of stages.
Two practical consequences. First, the system is testable in a way the earlier postures were not — each component can be evaluated and replaced independently. Second, the system is tunable per query class: a factoid lookup runs through a fast retriever and a small generator, a multi-document synthesis runs through several retrievers and a large one, both serving from the same component library. The price is operational. When an answer is wrong, the question where did this go wrong? now has many possible answers, and the team needs instrumentation that can localize the failure to a single component. Invest in the observability before the modular architecture, not after.
1.4 Agentic RAG: the LLM runs the pipeline
The fourth posture inverts an assumption the previous three quietly shared — that the LLM is the last step. In Agentic RAG, the LLM runs the pipeline. Given a catalog of tools (vector search, SQL, web fetch, reranker, calculator), the model thinks, picks a tool, observes the result, thinks again, and terminates when it has an answer or hits a step limit. The architecture has stopped being a pipeline and become a small program the model writes anew for each query.
This buys multi-step planning, dynamic tool selection, and multi-agent coordination patterns like planner/retriever/critic/writer. It costs latency, token spend, and reproducibility — a single query is now a tree of decisions rather than a fixed sequence, and pathological queries can spend many turns flailing before producing an answer. Production agentic systems need budget controls, step limits, and timeout policies that fixed pipelines never had to think about. The right use case is questions whose depth is variable and unpredictable: research synthesis, legal lookup over case law, literature review. The wrong use case is a static support bot, where the agentic loop adds variance the workload did not need.
1.5 RAG versus fine-tuning
The question every team eventually asks. The honest framing is that they solve different problems. RAG addresses knowledge problems — the model doesn’t know X, and X changes, and the user needs a citation. Fine-tuning addresses behavior problems — the model knows the answer but presents it in the wrong format, refuses to follow the company template, or rambles where it should be terse. RAG is cheap to set up and expensive per query. Fine-tuning is expensive once and cheap per query. RAG iterates in minutes (change a document); fine-tuning iterates in days. A useful heuristic: if the failure is the model doesn’t know, reach for RAG. If the failure is the model knows but does it wrong, reach for fine-tuning. Many mature systems eventually do both, but start with RAG — most enterprise failures are knowledge failures, not behavior failures.
What Chapter 1 sets up
Every RAG architecture — whichever of the four you pick — is downstream of how well it can read its source documents. A state-of-the-art Modular pipeline with an agentic orchestrator is still working from chunks that came out of a parsing step somewhere upstream. If that step lost the table structure, scrambled the multi-column reading order, or replaced figure captions with garbled OCR, every downstream component is reasoning over corrupted input. The architecture sets the system’s ceiling. The parser sets its floor. In most production systems, the floor matters more, because most production systems are nowhere near the ceiling.
Next — Chapter 2: Intelligent Document Parsing. Why a naive PDF-to-text utility silently destroys retrieval quality, what layout-aware parsing actually preserves, and the multimodal alternative that retrieves over page images directly.