Skip to main content
Back to Writing

How I Evaluate RAG Systems

RAG Evaluation LLMs

Evaluating RAG systems is harder than evaluating traditional ML models. You’re not just measuring accuracy — you’re measuring whether the system found the right information and used it correctly.

Here’s the framework I use.

The Three Layers of RAG Evaluation

  1. Retrieval Quality — Did we find the right documents?
  2. Answer Quality — Did we generate a good response from those documents?
  3. End-to-End Quality — Does the user get what they need?

Each layer requires different metrics and different evaluation approaches.


Layer 1: Retrieval Metrics

Before worrying about generation, make sure your retrieval is working.

Key Metrics

  • Recall@K — Of all relevant documents, how many did we retrieve in the top K?
  • MRR (Mean Reciprocal Rank) — How high did the first relevant document rank?
  • Precision@K — Of the K documents we retrieved, how many were relevant?

The Catch

You need labeled data — queries paired with their relevant documents. For a production system, this usually means:

  1. Start with synthetic queries generated from your documents
  2. Supplement with real user queries, manually labeled
  3. Continuously improve as you gather more data

Layer 2: Answer Quality

Even with perfect retrieval, generation can fail. The model might:

  • Ignore the context and hallucinate
  • Misinterpret the retrieved information
  • Generate a technically correct but unhelpful response

What to Measure

  • Faithfulness — Is the answer grounded in the retrieved context?
  • Relevance — Does the answer address the question?
  • Completeness — Does the answer cover all important points?

Evaluation Approaches

LLM-as-Judge: Use a separate LLM to score responses. Faster and cheaper than human evaluation, but calibrate against human judgments.

Human Evaluation: The gold standard, but expensive. Use it to validate your automated metrics.


Layer 3: End-to-End Quality

Ultimately, what matters is whether users get what they need.

Proxy Metrics

  • User satisfaction ratings (if you can collect them)
  • Follow-up question rate (lower is usually better)
  • Time to resolution (for support use cases)

The Feedback Loop

Build evaluation into your pipeline:

  1. Log every query, retrieval, and response
  2. Sample for human evaluation regularly
  3. Track metrics over time to catch regressions

Practical Tips

  1. Start simple — Recall@10 and a basic faithfulness check get you surprisingly far
  2. Invest in labeled data — It’s the bottleneck for everything else
  3. Monitor in production — Offline eval isn’t enough; real queries are different
  4. Iterate on retrieval first — Bad retrieval dooms good generation

RAG evaluation isn’t a one-time task — it’s an ongoing practice. Build the infrastructure to measure continuously, and you’ll catch problems before your users do.