rag-evaluation

Installation
SKILL.md

rag-evaluation

Covers the non-obvious measurement traps, metric choices, and eval infrastructure decisions for RAG pipelines. Assumes you have a working RAG pipeline and want to measure it.


1. The Three Failure Modes and Which Metrics Catch Them

RAG fails in three distinct ways. Each requires different metrics — conflating them produces misleading aggregate scores.

Failure mode Example Metric
Retrieval misses relevant doc Right answer exists, never retrieved Recall@k
Retrieval returns irrelevant docs Retrieved docs don't support the answer Precision@k, Context Relevance
Generation hallucinates Docs retrieved correctly, answer fabricated Faithfulness, Answer Grounding

Non-obvious: A high faithfulness score with low retrieval recall is a trap — the model is faithfully generating from the wrong context. Always report retrieval and generation metrics together; never report faithfulness alone.


Related skills

More from blunotech-dev/agents

Installs
1
GitHub Stars
2
First Seen
Apr 22, 2026