How I Evaluate RAG Systems

Evaluating RAG systems is harder than evaluating traditional ML models. You’re not just measuring accuracy — you’re measuring whether the system found the right information and used it correctly.

Here’s the framework I use.

The Three Layers of RAG Evaluation

Retrieval Quality — Did we find the right documents?
Answer Quality — Did we generate a good response from those documents?
End-to-End Quality — Does the user get what they need?

Each layer requires different metrics and different evaluation approaches.

Layer 1: Retrieval Metrics

Before worrying about generation, make sure your retrieval is working.

Key Metrics

Recall@K — Of all relevant documents, how many did we retrieve in the top K?
MRR (Mean Reciprocal Rank) — How high did the first relevant document rank?
Precision@K — Of the K documents we retrieved, how many were relevant?

The Catch

You need labeled data — queries paired with their relevant documents. For a production system, this usually means:

Start with synthetic queries generated from your documents
Supplement with real user queries, manually labeled
Continuously improve as you gather more data

Layer 2: Answer Quality

Even with perfect retrieval, generation can fail. The model might:

Ignore the context and hallucinate
Misinterpret the retrieved information
Generate a technically correct but unhelpful response

What to Measure

Faithfulness — Is the answer grounded in the retrieved context?
Relevance — Does the answer address the question?
Completeness — Does the answer cover all important points?

Evaluation Approaches

LLM-as-Judge: Use a separate LLM to score responses. Faster and cheaper than human evaluation, but calibrate against human judgments.

Human Evaluation: The gold standard, but expensive. Use it to validate your automated metrics.

Layer 3: End-to-End Quality

Ultimately, what matters is whether users get what they need.

Proxy Metrics

User satisfaction ratings (if you can collect them)
Follow-up question rate (lower is usually better)
Time to resolution (for support use cases)

The Feedback Loop

Build evaluation into your pipeline:

Log every query, retrieval, and response
Sample for human evaluation regularly
Track metrics over time to catch regressions

Practical Tips

Start simple — Recall@10 and a basic faithfulness check get you surprisingly far
Invest in labeled data — It’s the bottleneck for everything else
Monitor in production — Offline eval isn’t enough; real queries are different
Iterate on retrieval first — Bad retrieval dooms good generation

RAG evaluation isn’t a one-time task — it’s an ongoing practice. Build the infrastructure to measure continuously, and you’ll catch problems before your users do.