Chapter 5 — Architecting the Retrieval Pipeline

Published on: 2026-03-22 Last updated on: 2026-06-09 Version: 2

Chapter 5 — Architecting the Retrieval Pipeline

Fifth post of the chapter-by-chapter walkthrough of LLM Primer III: Enhancing Enterprise AI with RAG. A single vector search is where most prototypes stop, and where most production failures begin. The chapter walks the full path from a half-formed query to the final candidates that reach the generator — and why each stage exists.


Why this chapter exists

Chapters 2 through 4 produced a vector store: parsed documents, chunked carefully, embedded, indexed. The naive next step is to embed the user's query, run a nearest-neighbour lookup, and feed the top-k to the generator. For trivial corpora this works. For production it almost never does. Queries arrive underspecified and full of proper nouns the embedder has never seen; corpora contain near-duplicates that crowd the top of any single ranking; identifiers and codes carry meaning the embedder smooths away.

Chapter 5 is about the architecture that mature systems converge on. It is not a research artefact. It is the shape teams actually run when recall, precision, and latency all have to hold at once.

One line: a production retrieval pipeline is hybrid retrieval merged by reciprocal rank fusion, a cross-encoder reranker over the merged candidates, and a query-understanding stage in front that rewrites what was asked before the system tries to answer it.

5.1 Why a single vector search is not enough

Dense retrieval rewrites the economics of search, but the very tolerance for paraphrase that makes embeddings useful also makes them brittle. Numeric tokens, statute citations, transaction codes, part numbers — anything where the surface form is the meaning — land in arbitrary corners of the vector space. BM25, which simply counts what is there, has no such problem.

The two methods fail on disjoint inputs. That single observation is the entire case for hybrid retrieval, and it is the first principle the chapter rests on. The rest follows: if no single retriever is sufficient, the pipeline must combine several, merge their rankings honestly, spend real compute on a final precise pass, and prepare the query before any of it begins.

5.2 Hybrid search: dense vectors and BM25 in parallel

Two indexes over the same chunks: a dense HNSW or IVF index from the embedder, and a sparse inverted index from BM25 or SPLADE. Both are queried, both return ranked lists, and the ingestion pipeline writes to both in lockstep. BM25 is not legacy; it is the most reliable keyword-ranking function ever devised, parameter-light but not parameter-free, and across the BEIR benchmark suite hybrid retrieval outperforms dense-only on the majority of out-of-domain tasks. The gap widens as the domain drifts further from the embedder's training distribution.

A specific failure mode worth naming: BM25 has to be tokenised correctly for each language in scope. Shipping a Japanese corpus with the default English analyser silently turns the BM25 leg into pure noise, and the system appears to work only because the dense leg is doing all the lifting. Sparse retrieval has moved on too — SPLADE-style learned sparse models inherit BM25's operational simplicity with the recall behaviour of a dense retriever.

5.3 Reciprocal rank fusion and cross-encoder reranking

The naive way to combine two ranked lists is to add their scores. This does not work — BM25 magnitudes are unbounded and corpus-dependent; cosine similarities sit roughly in [-1, 1]. They are not commensurable, and any fixed weighting is a perpetual tuning problem. Reciprocal Rank Fusion sidesteps the problem by throwing away the scores. For each candidate, the fused score is the sum of 1/(k + rank) across retrievers, with k = 60. The shape is steep at the top and flat in the tail, results are insensitive to k, and the algorithm composes naturally with multi-query expansion — six lists from three paraphrases against two retrievers fuse with the same line of code.

RRF cannot rescue a document that neither retriever surfaced; that job belongs to query rewriting. What it does, cheaply and with no hyperparameters, is reconcile retrievers that are upgraded and swapped independently — a pipeline whose fusion does not need to be re-tuned when a leg changes is dramatically cheaper to maintain.

The bi-encoders that produced those rankings never saw the query and the chunk together. A cross-encoder reranker does — it concatenates the pair and runs them jointly through a transformer fine-tuned to output a relevance score. Every attention head sees both sides at once, and the model can attend to the specific phrase in the query that should match a specific phrase in the chunk. It cannot be precomputed, so it is too slow as a primary retriever, but it is perfect over the 50 to 200 candidates that hybrid retrieval surfaces. The lift on NDCG@10 is typically 5 to 15 points — larger than switching embedding models inside the bi-encoder family.

5.4 Query understanding: rewriting, expansion, HyDE

Everything above assumed the query was well-formed. It rarely is. A help-desk user types "vacation"; the policy says "annual leave entitlement." A developer types "auth fails 500"; the runbook describes "authentication service returns HTTP 500 with token validation failure." The work has to happen on the query side. Three patterns compose: rewriting the query to be standalone (resolving pronouns, expanding acronyms, switching language to match the corpus); expansion into a handful of paraphrases that each fan out to both retrievers; and HyDE, which asks a small model to write a hypothetical answer and embeds that instead of the question. The corpus is full of answers, and answers look more like other answers than they do like questions.

The defensive pattern is to keep the original query alongside the rewritten one and dispatch both. The rewriter's output is a hypothesis, not a replacement. Most "retrieval regressions" in agentic systems are actually rewriting regressions, and they are invisible without per-stage telemetry.

Worth holding onto: the production pipeline is six stages — classify, rewrite, retrieve in parallel, fuse by RRF, rerank with a cross-encoder, generate — and the reflex to use the most powerful available model at every stage is the single most common cause of expensive, slow, mediocre RAG systems. A 7B rewriter and a 110M reranker, sized to the job, beat a frontier model at every node almost every time.

What Chapter 5 sets up

The pipeline as drawn is also the entire surface an attacker needs to subvert. Each stage takes input, produces output, and trusts the data it operates on. The corpus can be poisoned at ingestion; the embedder can be manipulated by adversarial chunks; the reranker can be biased; the generator can be tricked by instructions hidden in retrieved content. From here the book opens Part IV, and the framing shifts from how to retrieve well to what happens when retrieval is attacked.


Next — Chapter 6: RAG Threat Models and Vulnerabilities. The same openness that makes RAG useful is also the surface adversaries exploit — corpus poisoning, adversarial retrieval, indirect prompt injection, embedding inversion, and the confused deputy.

Want the full picture? The book carries the BM25 formula and its derivation in Appendix A, the full RRF mathematics, late-interaction ColBERT as a third architecture between bi- and cross-encoders, and an end-to-end production pipeline diagram with per-stage telemetry and latency budgets. View LLM Primer III on Amazon →

SHO
SHO
CTO of Receipt Roller Inc., he builds innovative AI solutions and writes to make large language models more understandable, sharing both practical uses and behind-the-scenes insights.