rag-architect
SKILL.md
RAG Architect
The agent designs, implements, and optimizes production-grade Retrieval-Augmented Generation pipelines, covering the full lifecycle from document chunking through evaluation.
Workflow
- Analyse corpus -- Profile the document collection: count, average length, format mix (PDF, HTML, Markdown), language(s), and domain. Validate that sample documents are accessible before proceeding.
- Select chunking strategy -- Choose from the Chunking Strategy Matrix based on corpus characteristics. Set chunk size, overlap, and boundary rules. Run a test split on 100 sample documents.
- Choose embedding model -- Select an embedding model from the Embedding Model table based on domain, latency budget, and cost constraints. Verify dimension compatibility with the target vector database.
- Select vector database -- Pick a vector store from the Vector Database Comparison based on scale, query patterns, and operational requirements.
- Design retrieval pipeline -- Configure retrieval strategy (dense, sparse, or hybrid). Add reranking if precision requirements exceed 0.85. Set the top-K parameter and similarity threshold.
- Implement query transformations -- If query-document style mismatch exists, enable HyDE. If queries are ambiguous, enable multi-query generation. Validate each transformation improves retrieval metrics on a held-out set.
- Configure guardrails -- Enable PII detection, toxicity filtering, hallucination detection, and source attribution. Set confidence score thresholds.
- Evaluate end-to-end -- Run the RAGAS evaluation framework. Verify faithfulness > 0.90, context relevance > 0.80, answer relevance > 0.85. Iterate on weak components.
Chunking Strategy Matrix
| Strategy | Best For | Chunk Size | Overlap | Pros | Cons |
|---|---|---|---|---|---|
| Fixed-size (token) | Uniform docs, consistent sizing | 512-2048 tokens | 10-20% | Predictable, simple | Breaks semantic units |
| Sentence-based | Narrative text, articles | 3-8 sentences | 1 sentence | Preserves language boundaries | Variable sizes |
| Paragraph-based | Structured docs, technical manuals | 1-3 paragraphs | 0-1 paragraph | Preserves topic coherence | Highly variable sizes |
| Semantic | Long-form, research papers | Dynamic | Topic-shift detection | Best coherence | Computationally expensive |
| Recursive | Mixed content types | Dynamic, multi-level | Per-level | Optimal utilization | Complex implementation |
| Document-aware | Multi-format collections | Format-specific | Section-level | Preserves metadata | Format-specific code required |
Embedding Model Comparison
| Model | Dimensions | Speed | Quality | Cost | Best For |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | ~14K tok/s | Good | Free (local) | Prototyping, low-latency |
| all-mpnet-base-v2 | 768 | ~2.8K tok/s | Better | Free (local) | Balanced production use |
| text-embedding-3-small | 1536 | API | High | $0.02/1M tokens | Cost-effective production |
| text-embedding-3-large | 3072 | API | Highest | $0.13/1M tokens | Maximum quality |
| Domain fine-tuned | Varies | Varies | Domain-best | Training cost | Specialized domains (legal, medical) |
Vector Database Comparison
| Database | Type | Scaling | Key Feature | Best For |
|---|---|---|---|---|
| Pinecone | Managed | Auto-scaling | Metadata filtering, hybrid search | Production, managed preference |
| Weaviate | Open source | Horizontal | GraphQL API, multi-modal | Complex data types |
| Qdrant | Open source | Distributed | High perf, low memory (Rust) | Performance-critical |
| Chroma | Embedded | Limited | Simple API, SQLite-backed | Prototyping, small-scale |
| pgvector | PostgreSQL ext | PostgreSQL scaling | ACID, SQL joins | Existing PostgreSQL infra |
Retrieval Strategies
| Strategy | When to Use | Implementation |
|---|---|---|
| Dense (vector similarity) | Default for semantic search | Cosine similarity with k-NN/ANN |
| Sparse (BM25/TF-IDF) | Exact keyword matching needed | Elasticsearch or inverted index |
| Hybrid (dense + sparse) | Best of both needed | Reciprocal Rank Fusion (RRF) with tuned weights |
| + Reranking | Precision must exceed 0.85 | Cross-encoder reranker after initial retrieval |
Query Transformation Techniques
| Technique | When to Use | How It Works |
|---|---|---|
| HyDE | Query/document style mismatch | LLM generates hypothetical answer; embed that instead of query |
| Multi-query | Ambiguous queries | Generate 3-5 query variations; retrieve for each; deduplicate |
| Step-back | Specific questions needing general context | Transform to broader query; retrieve general + specific |
Context Window Optimization
- Relevance ordering: Most relevant chunks first in the context window
- Diversity: Deduplicate semantically similar chunks
- Token budget: Fit within model context limit; reserve tokens for system prompt and answer
- Hierarchical inclusion: Include section summary before detailed chunks when available
- Compression: Summarize low-relevance chunks; extract key facts from verbose passages
Evaluation Metrics (RAGAS Framework)
| Metric | Target | What It Measures |
|---|---|---|
| Faithfulness | > 0.90 | Answers grounded in retrieved context |
| Context Relevance | > 0.80 | Retrieved chunks relevant to query |
| Answer Relevance | > 0.85 | Answer addresses the original question |
| Precision@K | > 0.70 | % of top-K results that are relevant |
| Recall@K | > 0.80 | % of relevant docs found in top-K |
| MRR | > 0.75 | Reciprocal rank of first relevant result |
Guardrails
- PII detection: Scan retrieved chunks and generated responses for PII; redact or block
- Hallucination detection: Compare generated claims against source documents via NLI
- Source attribution: Every factual claim must cite a retrieved chunk
- Confidence scoring: Return confidence level; if below threshold, return "I don't have enough information"
- Injection prevention: Sanitize user queries; reject prompt injection attempts
Example: Internal Knowledge Base RAG Pipeline
corpus:
documents: 12,000 Confluence pages + 3,000 PDFs
avg_length: 2,400 tokens
languages: [English]
domain: internal engineering docs
pipeline:
chunking:
strategy: recursive
max_tokens: 512
overlap: 50 tokens
boundary: paragraph
embedding:
model: text-embedding-3-small
dimensions: 1536
batch_size: 100
vector_db:
engine: pgvector
index: HNSW (ef_construction=128, m=16)
reason: "Existing PostgreSQL infra; ACID compliance for audit"
retrieval:
strategy: hybrid
dense_weight: 0.7
sparse_weight: 0.3
top_k: 10
reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
final_k: 5
evaluation_results:
faithfulness: 0.93
context_relevance: 0.84
answer_relevance: 0.88
precision_at_5: 0.76
recall_at_10: 0.85
Production Patterns
- Caching: Query-level (exact match), semantic (similar queries via embedding distance < 0.05), chunk-level (embedding cache)
- Streaming: Stream generation tokens while retrieval completes; show sources after generation
- Fallbacks: If primary vector DB is unavailable, serve from read-replica; if retrieval returns no results above threshold, say so explicitly
- Document refresh: Incremental re-embedding on change detection; full re-index weekly
- Cost control: Batch embeddings, cache aggressively, route simple queries to BM25 only
Common Pitfalls
| Problem | Solution |
|---|---|
| Chunks break mid-sentence | Use boundary-aware chunking with sentence/paragraph overlap |
| Low retrieval precision | Add cross-encoder reranker; tune similarity threshold |
| High latency (> 2s) | Cache embeddings; use faster model; reduce top-K |
| Inconsistent quality | Implement RAGAS evaluation in CI; add quality scoring |
| Scalability bottleneck | Shard vector DB; implement auto-scaling; add caching layer |
Scripts
Chunking Optimizer
Analyses corpus and recommends optimal chunking strategy with parameters.
Retrieval Evaluator
Runs evaluation suite (precision, recall, MRR, NDCG) against a test query set.
Pipeline Benchmarker
Measures end-to-end latency, throughput, and cost per query across configurations.
Weekly Installs
36
Repository
borghei/claude-skillsGitHub Stars
38
First Seen
Feb 28, 2026
Security Audits
Installed on
claude-code32
opencode22
gemini-cli22
cline22
github-copilot22
codex22