Chapter 10 — Leading Evaluation Frameworks

Published on: 2026-03-27 Last updated on: 2026-06-09 Version: 2

Chapter 10 — Leading Evaluation Frameworks

Tenth post of the chapter-by-chapter walkthrough of LLM Primer III: Enhancing Enterprise AI with RAG. In which the Triad gets a toolkit — eight frameworks in two flavors — and one honest admission about the part of evaluation none of them yet solves.


Why this chapter exists

The Triad of Chapter 9 said what to measure. It said nothing about how to actually run those measurements in production. The metrics are concepts. Their implementations are prompts for the judge, decomposition logic for claims, embedding choices for similarity, sampling strategies for cost, dashboards, alerts, and human review loops. No team builds all of this from scratch.

The frameworks have split along a recognizable axis. On one side, the metric-first libraries — RAGAS, TruLens, DeepEval — computing the Triad with documented, reproducible methodology. On the other, the observability platforms — Braintrust, LangSmith, Phoenix, Galileo, Opik — starting from production traces and integrating evaluation as one feature in a larger workflow. Choosing between them is less a feature comparison than a question of what the team needs the system to do on the day after it ships.

One line: pick a metric-first library for defensible numbers, pick an observability platform for the workflow around them, and accept that retrieval-layer evaluation is still the team's responsibility because no framework has yet closed the Evaluation Gap.

10.1 The metric-first libraries — RAGAS, TruLens, DeepEval

RAGAS is the closest thing the field has to a reference implementation of the Triad. Conservative metric set, documented prompts, every LLM call exposed. The Faithfulness pipeline is a two-stage chain — decompose the answer into claims, check each claim against the context — and the intermediate claim list comes back in the result, so an engineer can point at the specific claim the framework decided was unsupported. For research, regulated industries, or any team that needs to defend its methodology, RAGAS is the default. The cost is that it feels academic. It computes metrics; it does not ship a dashboard.

TruLens sits between metric library and observability platform. Its emphasis is instrumentation: the framework wraps the application at the function level, capturing every retrieval call and every model response into a structured Trace, then runs the Triad against the traces. The Triad metrics are exposed as feedback functions — small composable units, easy to write your own. TruLens is the right pick when the team's evaluation needs go beyond the standard set, or when the workflow centers on inspecting individual failures rather than aggregate dashboards.

DeepEval takes a third stance: evaluation as pytest. Test cases are written and run with a pytest-like CLI; failures block the PR; the metric set is the broadest of the three, including bias, toxicity, and hallucination checks alongside the Triad. The tradeoff is that the breadth comes at uneven rigor. Pick a metric off the menu without reading its implementation and you can end up reporting numbers that do not mean what you think. The right discipline is to pick a small set, read the prompts, calibrate against hand-labels, and treat the rest as inspiration.

10.2 The observability platforms — Braintrust, LangSmith, Phoenix, Galileo, Opik

The metric-first libraries answer "how do I compute the Triad." The observability platforms answer "how do I run a RAG system in production." They start from the assumption that the team will be writing prompts, comparing model versions, ingesting traces, and watching dashboards for the foreseeable future. Evaluation is one feature; prompt versioning, trace exploration, A/B testing, and alerting are equally part of the value.

Braintrust leads with developer experience — the experiment, a versioned record of model behavior on a dataset, with side-by-side score diffs in a UI that is genuinely pleasant to use. LangSmith is the natural choice for teams already deep in LangChain; it expects to be at the center of the application's instrumentation and rewards that commitment with depth. Phoenix, from Arize, is the open-source option, distinguished by embedding-drift and cluster analysis the others underplay — viable for teams that cannot send traces to a SaaS endpoint. Galileo is the enterprise-focused platform, with a proprietary Correctness score and on-prem deployment for regulated industries. Opik, from Comet, is the most recent entrant, open-source-first like Phoenix, polished like Braintrust, with the additional advantage of unifying LLM and classical-ML observability under one platform.

The choice among the five is less about features than organizational fit. LangChain shop reaches for LangSmith. Greenfield product-engineering team reaches for Braintrust. Open-source-first team reaches for Phoenix. Regulated enterprise reaches for Galileo. Comet shop reaches for Opik. Metric quality across all five is broadly comparable — they all implement the Triad, they all use LLM-as-Judge, they all carry the same fundamental limitations Chapter 9 named. The differences are workflow, not measurement.

10.3 The Evaluation Gap and the three-loop pattern

Here is the awkward fact the framework tour just glossed over. Almost every tool above evaluates at the inference layer: given that retrieval has already happened, did the model honor the chunks, did the answer fit the question, were the chunks on-topic. None of them, in any deep sense, tells you whether retrieval found the right chunks in the first place. The retriever's output is the inference layer's input, and the inference-layer metrics measure the output but not the input. If retrieval consistently misses an important document, Faithfulness stays high (the model honored what it was given), Answer Relevance stays high (the answer fit the question shape), and users get the wrong answer anyway.

This is the Evaluation Gap. The structural reason is that inference-layer evaluation is reference-free, while rigorous context-layer evaluation needs to know what the right chunks were — which requires either a labeled set or a synthetic one. The workarounds — synthetic question generation, needle-in-a-haystack probes, downstream-impact proxies, query-conditioned retrieval auditing, retrieval-against-self — are all useful and all incomplete. The honest summary is that context-layer evaluation is the open frontier and teams should expect to invest some of their own engineering directly in retrieval quality. The frameworks help with the inference loop; they leave the retrieval loop mostly to the team.

The pattern that mature teams converge on is three loops, not one. Inner loop: a metric-first library (typically RAGAS or DeepEval) runs the Triad on a fixed regression set on every meaningful change, fast and deterministic, oriented at catching regressions. Outer loop: an observability platform handles production trace storage, online metric computation against sampled traffic, dashboards, and alerts — oriented at drift the regression set will miss. Slow loop: a small human review function calibrates the LLM judges, audits flagged traces, and maintains the regression set as the product evolves. A team with only the inner loop ships drifting production. A team with only the outer loop sees the drift but cannot debug it. A team with both but no human review trusts judges that are quietly going wrong. All three are necessary, and the value of the frameworks is how much of each loop they make easy.

Worth holding onto: the discipline that distinguishes teams that ship reliable RAG is treating evaluation as engineering. Metrics are code. Test sets are data assets. Judges are dependencies. Calibration is recurring maintenance. The frameworks make this discipline cheaper to practice; none of them substitute for it.

What Chapter 10 sets up

Between the Triad of Chapter 9 and the frameworks of Chapter 10, a team has what it needs to measure a RAG system: a vocabulary, an honest accounting of what the vocabulary misses, and a small set of tools that turn the measurements into dashboards. The system will be legible. But measurement is only half the production story. A system that is measured but not maintained will degrade anyway, because the documents change, the users change, and the underlying models change. Knowing quality has dropped is not the same as being able to restore it.


Next — Chapter 11: Continuous Updates and Pipeline Optimization. CDC and incremental indexing, semantic caching and model tiering, and the four-stage feedback loop that turns production telemetry into the kind of system that actually gets better the longer it runs.

Want the full picture? The book carries the framework-by-framework comparison further — synthetic data generation, cost structure, prompt portability, the lock-in implications of each platform's data model — and walks through a concrete three-loop deployment used by teams the author has worked with. View LLM Primer III on Amazon →

SHO
SHO
CTO of Receipt Roller Inc., he builds innovative AI solutions and writes to make large language models more understandable, sharing both practical uses and behind-the-scenes insights.