rag-auditor
RAG Auditor
Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.
Reference Files
| File | Contents | Load When |
|---|---|---|
references/retrieval-metrics.md |
Precision@K, Recall@K, MRR, NDCG definitions and calculation | Always |
references/generation-metrics.md |
Groundedness, completeness, hallucination detection methods | Generation evaluation needed |
references/failure-taxonomy.md |
RAG failure categories: retrieval, generation, chunking, embedding | Failure diagnosis needed |
references/diagnostic-queries.md |
Designing evaluation query sets, known-answer questions, difficulty levels | Evaluation setup |
Prerequisites
- Access to the RAG pipeline (or its outputs for post-hoc evaluation)
- A set of test queries with known-correct answers
- Understanding of the pipeline components (embedding model, retriever, generator)
Workflow
Phase 1: Pipeline Inventory
Document the RAG pipeline configuration:
- Document source — What documents are indexed? Format, count, size.
- Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
- Embedding — Model name and version, dimensionality.
- Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
- Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
- Generation — Model, prompt template, context window usage.
Phase 2: Design Evaluation Queries
Create a diverse set of test queries:
| Query Type | Purpose | Count |
|---|---|---|
| Known-answer (factoid) | Measure retrieval + generation accuracy | 10+ |
| Multi-hop | Require combining info from multiple chunks | 5+ |
| Unanswerable | Not in the corpus — should abstain | 3+ |
| Ambiguous | Multiple valid interpretations | 3+ |
| Recent/updated | Test freshness | 2+ |
For each query, document the expected answer and the source chunk(s).
Phase 3: Evaluate Retrieval
For each test query, measure:
- Precision@K — Of the K retrieved chunks, how many are relevant?
- Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
- MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
- Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.
Phase 4: Evaluate Generation
For each test query with retrieved context:
- Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
- Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
- Hallucination detection — Identify specific claims not supported by context.
- Abstention — For unanswerable queries, does the model correctly say "I don't know"?
Phase 5: Diagnose Failures
For every incorrect or low-quality response, classify the root cause:
| Failure Type | Diagnosis | Indicator |
|---|---|---|
| Retrieval failure | Relevant chunks not retrieved | Low Recall@K |
| Ranking failure | Relevant chunk retrieved but ranked low | Low MRR, high Recall |
| Chunk boundary issue | Answer split across chunk boundaries | Partial matches in multiple chunks |
| Embedding mismatch | Query semantics don't match chunk embeddings | Relevant chunk has low similarity score |
| Generation failure | Correct context but wrong answer | High retrieval scores, low groundedness |
| Hallucination | Model invents facts not in context | Claims not traceable to any chunk |
| Over-abstention | Model refuses to answer when context is sufficient | Unanswered with relevant context present |
Phase 6: Recommendations
Based on failure analysis, recommend specific improvements:
| Failure Pattern | Recommendation |
|---|---|
| Chunk boundary issues | Increase overlap, try semantic chunking |
| Low Precision@K | Reduce K, add reranking stage |
| Low Recall@K | Increase K, try hybrid search |
| Embedding mismatch | Try different embedding model, add query expansion |
| Hallucination | Strengthen grounding instruction in prompt, reduce temperature |
| Over-abstention | Soften abstention criteria in prompt |
Output Format
## RAG Audit Report
### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |
### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}
### Retrieval Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |
### Generation Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |
### Failure Analysis
| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |
### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}
### Sample Failures
#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}
Calibration Rules
- Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
- Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
- Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
- Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.
Error Handling
| Problem | Resolution |
|---|---|
| No known-answer queries available | Help design them from the document corpus. Pick 10 facts and formulate questions. |
| Pipeline access not available | Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples. |
| Corpus is too large to review | Sample-based evaluation. Select representative documents and generate queries from them. |
| Multiple failure types co-exist | Address retrieval failures first. Generation quality cannot exceed retrieval quality. |
When NOT to Audit
Push back if:
- The pipeline hasn't been built yet — design it first, audit after
- The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
- The user wants to compare embedding models — that's a benchmark task, not an audit
More from mathews-tom/praxis-skills
manuscript-review
Pre-publication manuscript audit producing a section-level refactoring report with citation hygiene and submission-readiness checks. Triggers on: "review my paper", "check before submission", "is this ready to submit", "pre-pub checklist", "refactor my paper", "check my references", "does the abstract work".
64html-presentation
Converts documents, outlines, or notes into self-contained HTML slide decks with horizontal (Reveal.js) or vertical scroll navigation and multiple themes. Triggers on: "create a presentation", "slide deck", "pitch deck", "HTML presentation", "web-based slides", "reveal.js deck", "convert document into slides".
59filesystem
File and directory operations via Claude Code built-in tools, replacing the Filesystem MCP server. Triggers on: "read this file", "write to file", "edit file", "find files matching", "search for text in files", "list directory", "show directory tree", "rename file".
39md-to-pdf
Convert Markdown to styled PDFs with Mermaid diagrams, LaTeX/KaTeX math, tables, and code highlighting. Triggers on: "convert markdown to pdf", "make a pdf from this md", "export markdown as pdf", "pdf from markdown with equations".
36concept-to-video
Turn concepts into animated explainer videos using Manim (Python) with MP4/GIF output, audio overlay, multi-scene composition. Triggers on: "create a video", "animate this", "make an explainer", "manim animation", "motion graphic". NOT for React video, use remotion-video.
32concept-to-image
Turn concepts into static HTML visuals exported as PNG or SVG files via HTML/CSS/SVG. Triggers on: "create an image of", "export as PNG", "save as SVG", "concept to image", "screenshot this HTML". NOT for interactive HTML, use static-web-artifacts-builder.
31