rag-evaluation

Covers the non-obvious measurement traps, metric choices, and eval infrastructure decisions for RAG pipelines. Assumes you have a working RAG pipeline and want to measure it.

1. The Three Failure Modes and Which Metrics Catch Them

RAG fails in three distinct ways. Each requires different metrics — conflating them produces misleading aggregate scores.

Failure mode	Example	Metric
Retrieval misses relevant doc	Right answer exists, never retrieved	Recall@k
Retrieval returns irrelevant docs	Retrieved docs don't support the answer	Precision@k, Context Relevance
Generation hallucinates	Docs retrieved correctly, answer fabricated	Faithfulness, Answer Grounding

Non-obvious: A high faithfulness score with low retrieval recall is a trap — the model is faithfully generating from the wrong context. Always report retrieval and generation metrics together; never report faithfulness alone.

rag-evaluation

rag-evaluation

1. The Three Failure Modes and Which Metrics Catch Them

More from blunotech-dev/agents

anti-purple-ui

harmonize-whitespace

typographic-hierarchy

micro-interaction-adder

consistent-border-radius

component-split