Chapter 9 — The RAG Evaluation Triad

Published on: 2026-03-26 Last updated on: 2026-06-09 Version: 2

Chapter 9 — The RAG Evaluation Triad

Ninth post of the chapter-by-chapter walkthrough of LLM Primer III: Enhancing Enterprise AI with RAG. In which three different failures collapse into one symptom — and the field invents a three-headed metric that finally tells the team which symptom is which.


Why this chapter exists

A RAG system can fail in three different places, and from the outside the failures look identical. The retriever fetches the wrong context. The model ignores the right context. The model honors the context but answers a different question than the one that was asked. Every production team has, at one point or another, tried to fix one of those failures while measuring another. This chapter is about the small, stubborn vocabulary that prevents that mistake.

It is also a chapter about a shift. Classical information retrieval was evaluated against labeled ground truth — queries with known correct documents, precision and recall computed against the labels. RAG operates in a world where no such labels exist. Questions are open-ended, answers are generative, the relevant context is whatever the model needs at that moment. The Triad was designed for this world. It measures consistency between stages, not agreement with a reference.

One line: health is three numbers, not one — Context Relevance for retrieval, Groundedness for generation, Answer Relevance for the fit between question and answer — all three computed reference-free by an LLM judge that the team must keep honest.

9.1 Why three signals, not one

The instinct of a new team is to grade the final answer. The user typed a question, the system produced a response, either the response is correct or it is not. The instinct fails because the final answer is a composite of every stage, and when it fails the composite tells you nothing about which stage to fix. Was the right document missed? Was it retrieved and ignored? Was it used but answered to a different question? Three different bugs, three different fixes, one indistinguishable symptom.

The Triad separates the pipeline into the three places where information either survives or is lost. Retrieval, grounding, answering. Each gets its own metric: Context Relevance, Groundedness, Answer Relevance. What makes the structure useful is not that the three are exhaustive — they are not — but that they are independent. A system can score well on one and poorly on another, and when it does, the team knows where to look. When a new embedding model is shipped, Context Relevance should move. When a new prompt is shipped, Groundedness should move. When the metric that ought to move moves, the team knows the change worked. A single end-to-end score collapses all of this into something that cannot be debugged.

9.2 Context Relevance — did you retrieve the right context?

Context Relevance asks whether the retrieved chunks are about the question, sentence by sentence, scored by an LLM judge. It captures retrieval precision — the fraction of the context window being spent on relevant material. A high score means the retriever is not wasting tokens. A low score means it is bringing back noise, and the model pays for that noise in both latency and quality, because long irrelevant contexts have been shown repeatedly to degrade generation.

What Context Relevance does not capture is recall — whether all the chunks the model would have needed were actually retrieved. A retriever that brings back one perfect chunk and nothing else scores perfectly, even if the answer required two and the second was missed. Recall is its own problem, measured against curated golden sets where the answer-bearing documents are known. The chapter also names two artifacts worth knowing: aggressive chunking inflates Context Relevance without necessarily improving the answer, and unweighted averaging over a fixed top-k can make a retriever look bad when the irrelevant chunks at positions four through ten are barely affecting the model anyway.

9.3 Groundedness — did the model honor the context?

Groundedness, sometimes called Faithfulness, asks the opposite question: of the claims the model produced, what fraction can be supported by the retrieved context? The standard computation decomposes the answer into atomic claims and asks the judge, for each one, whether the context supports it. Decomposition is the part that matters. A long answer evaluated as a single block tends to score either fully grounded or fully ungrounded, with the judge resolving toward whichever direction the overall gist leans. Atomic claims force the judge to evaluate each assertion independently — which catches the common failure where a mostly-correct answer contains one sentence the context never supported.

The chapter is honest about Groundedness's asymmetry: it penalizes invention but not omission. A model that refuses to answer scores perfectly. A model that gives a correct, well-grounded answer but drops a crucial caveat from the context also scores well. It is also the metric most likely to surface a prompt problem rather than a model problem. When Context Relevance is high and Groundedness is low, the answer is almost always in the system prompt, not in the model — the instructions are too soft to keep the model inside the context. Tighten the prompt before you blame the model.

9.4 Answer Relevance and the reference-free shift

Answer Relevance is the easiest one to misunderstand. It does not measure correctness, and it does not measure grounding. It measures whether the answer addresses the question that was asked. A factually correct response that answers a slightly different question scores poorly. A polite refusal scores poorly. The standard computation is a clever inversion: given the answer, generate the questions it could plausibly be a response to, then compare those generated questions to the original. If they are close, the answer is on-target. If they drift, the model has wandered.

Answer Relevance is also where the reference-free shift bites hardest. None of these metrics can be computed by comparing against a labeled correct answer — the space of acceptable answers is infinite and not enumerable. So the field has converged on LLM-as-a-Judge: a frontier model grades each metric using a documented prompt. The technique scales. It is cheap. It correlates roughly with human judgment. It also has well-documented failure modes — position bias in pairwise comparisons, length bias, model-family bias, calibration drift across silent model updates, and the deeper problem that judges and generators share training corpora and therefore fail in correlated ways. The defense is not technical but operational: pin the judge model and prompt, calibrate against a small hand-labeled set, route a small fraction of judged outputs to human review, and treat any judge change as a re-baselining event that invalidates historical comparisons.

Worth holding onto: the value of the Triad is not the absolute scores, which are noisy. It is the structure of the relationships between the scores. When all three move together, the system is healthy or sick as a whole. When they move apart, the team learns where to look. That diagnostic power is what no single end-to-end number can provide.

What Chapter 9 sets up

The Triad gives a vocabulary for what to measure. It does not say how to actually run the measurements — the prompts for the judge, the decomposition logic, the embedding choice, the sampling rate, the dashboards, the alerting. None of that gets built from scratch. Over the past two years a small number of frameworks have emerged to make the Triad measurable in practice, each with its own opinions about what production evaluation should feel like. Chapter 10 walks through them side by side.


Next — Chapter 10: Leading Evaluation Frameworks. RAGAS, TruLens, DeepEval, and the observability platforms — what each one is for, where the metric-first libraries end and the production platforms begin, and the Evaluation Gap none of them has yet closed.

Want the full picture? The book walks through each metric's exact computation, the documented LLM-as-Judge failure modes with citations, the calibration discipline that keeps judges honest, and the chunk-attribution methods on the frontier. View LLM Primer III on Amazon →

SHO
SHO
CTO of Receipt Roller Inc., he builds innovative AI solutions and writes to make large language models more understandable, sharing both practical uses and behind-the-scenes insights.