rag-observability-evals
SKILL.md
RAG Observability and Evaluations
Run retrieval-augmented generation like a measurable production system, not a black box.
What to Measure
Retrieval Quality
- Recall@k and MRR for top-k chunks
- Citation coverage and source freshness
- Embedding drift and index staleness
Generation Quality
- Groundedness score (answer supported by retrieved context)
- Hallucination rate by route/use case
- Instruction adherence and format validity
Reliability and Cost
- p50/p95 latency split by retrieval vs generation
- Token usage per stage
- Cache hit rate and cost per successful answer
Evaluation Pipeline
- Curate a benchmark set with gold answers and source docs.
- Run nightly offline evals for every retriever/model configuration.
- Execute online shadow evals on sampled production traffic.
- Gate releases on minimum quality + safety + latency thresholds.
Alerting Strategy
Page on:
- sharp decline in groundedness,
- spike in unanswered or fallback responses,
- index freshness SLA breach,
- cost-per-answer anomaly.
Practical Guardrails
- Force citations for high-risk domains.
- Return abstain/fallback when confidence is below threshold.
- Re-rank retrieved chunks before final generation.
- Use query rewriting only with strict regression tests.
Incident Triage Checklist
- Did embedding model change?
- Did chunking/indexing logic change?
- Did source corpus ingestion fail?
- Did gateway route to unintended model tier?
Related Skills
- rag-infrastructure - Deploy robust RAG backends
- agent-observability - Instrument requests, traces, and costs
- agent-evals - Build repeatable eval suites
Weekly Installs
3
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
7 days ago
Security Audits
Installed on
opencode3
antigravity3
claude-code3
github-copilot3
codex3
zencoder3