How I Evaluate RAG Systems
Evaluating RAG systems is harder than evaluating traditional ML models. You’re not just measuring accuracy — you’re measuring whether the system found the right information and used it correctly.
Here’s the framework I use.
The Three Layers of RAG Evaluation
- Retrieval Quality — Did we find the right documents?
- Answer Quality — Did we generate a good response from those documents?
- End-to-End Quality — Does the user get what they need?
Each layer requires different metrics and different evaluation approaches.
Layer 1: Retrieval Metrics
Before worrying about generation, make sure your retrieval is working.
Key Metrics
- Recall@K — Of all relevant documents, how many did we retrieve in the top K?
- MRR (Mean Reciprocal Rank) — How high did the first relevant document rank?
- Precision@K — Of the K documents we retrieved, how many were relevant?
The Catch
You need labeled data — queries paired with their relevant documents. For a production system, this usually means:
- Start with synthetic queries generated from your documents
- Supplement with real user queries, manually labeled
- Continuously improve as you gather more data
Layer 2: Answer Quality
Even with perfect retrieval, generation can fail. The model might:
- Ignore the context and hallucinate
- Misinterpret the retrieved information
- Generate a technically correct but unhelpful response
What to Measure
- Faithfulness — Is the answer grounded in the retrieved context?
- Relevance — Does the answer address the question?
- Completeness — Does the answer cover all important points?
Evaluation Approaches
LLM-as-Judge: Use a separate LLM to score responses. Faster and cheaper than human evaluation, but calibrate against human judgments.
Human Evaluation: The gold standard, but expensive. Use it to validate your automated metrics.
Layer 3: End-to-End Quality
Ultimately, what matters is whether users get what they need.
Proxy Metrics
- User satisfaction ratings (if you can collect them)
- Follow-up question rate (lower is usually better)
- Time to resolution (for support use cases)
The Feedback Loop
Build evaluation into your pipeline:
- Log every query, retrieval, and response
- Sample for human evaluation regularly
- Track metrics over time to catch regressions
Practical Tips
- Start simple — Recall@10 and a basic faithfulness check get you surprisingly far
- Invest in labeled data — It’s the bottleneck for everything else
- Monitor in production — Offline eval isn’t enough; real queries are different
- Iterate on retrieval first — Bad retrieval dooms good generation
RAG evaluation isn’t a one-time task — it’s an ongoing practice. Build the infrastructure to measure continuously, and you’ll catch problems before your users do.