rag-architecture
SKILL.md
RAG Architecture
When to Use This Skill
Use this skill when:
- Designing RAG pipelines for LLM applications
- Choosing chunking and embedding strategies
- Optimizing retrieval quality and relevance
- Building knowledge-grounded AI systems
- Implementing hybrid search (dense + sparse)
- Designing multi-stage retrieval pipelines
Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval
RAG Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Ingestion │ │ Indexing │ │ Vector Store │ │
│ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ Documents Chunks + Indexed │
│ Embeddings Vectors │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Query │ │ Retrieval │ │ Context Assembly │ │
│ │ Processing │───▶│ Engine │───▶│ + Generation │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ User Query Top-K Chunks LLM Response │
│ │
└─────────────────────────────────────────────────────────────────────┘
Document Ingestion Pipeline
Document Processing Steps
Raw Documents
│
▼
┌─────────────┐
│ Extract │ ← PDF, HTML, DOCX, Markdown
│ Content │
└─────────────┘
│
▼
┌─────────────┐
│ Clean & │ ← Remove boilerplate, normalize
│ Normalize │
└─────────────┘
│
▼
┌─────────────┐
│ Chunk │ ← Split into retrievable units
│ Documents │
└─────────────┘
│
▼
┌─────────────┐
│ Generate │ ← Create vector representations
│ Embeddings │
└─────────────┘
│
▼
┌─────────────┐
│ Store │ ← Persist vectors + metadata
│ in Index │
└─────────────┘
Chunking Strategies
Strategy Comparison
| Strategy | Description | Best For | Chunk Size |
|---|---|---|---|
| Fixed-size | Split by token/character count | Simple documents | 256-512 tokens |
| Sentence-based | Split at sentence boundaries | Narrative text | Variable |
| Paragraph-based | Split at paragraph boundaries | Structured docs | Variable |
| Semantic | Split by topic/meaning | Long documents | Variable |
| Recursive | Hierarchical splitting | Mixed content | Configurable |
| Document-specific | Custom per doc type | Specialized (code, tables) | Variable |
Chunking Decision Tree
What type of content?
├── Code
│ └── AST-based or function-level chunking
├── Tables/Structured
│ └── Keep tables intact, chunk surrounding text
├── Long narrative
│ └── Semantic or recursive chunking
├── Short documents (<1 page)
│ └── Whole document as chunk
└── Mixed content
└── Recursive with type-specific handlers
Chunk Overlap
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
↑
Information lost at boundary
With Overlap (20%):
[Chunk 1: "The quick brown fox"]
[Chunk 2: "brown fox jumps over"]
↑
Context preserved across boundaries
Recommended overlap: 10-20% of chunk size
Chunk Size Trade-offs
Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens)
├── More precise retrieval ├── More context per chunk
├── Less context per chunk ├── May include irrelevant content
├── More chunks to search ├── Fewer chunks to search
├── Better for factoid Q&A ├── Better for summarization
└── Higher retrieval recall └── Higher retrieval precision
Embedding Models
Model Comparison
| Model | Dimensions | Context | Strengths |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8K | High quality, expensive |
| OpenAI text-embedding-3-small | 1536 | 8K | Good quality/cost ratio |
| Cohere embed-v3 | 1024 | 512 | Multilingual, fast |
| BGE-large | 1024 | 512 | Open source, competitive |
| E5-large-v2 | 1024 | 512 | Open source, instruction-tuned |
| GTE-large | 1024 | 512 | Alibaba, good for Chinese |
| Sentence-BERT | 768 | 512 | Classic, well-understood |
Embedding Selection
Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
└── Need self-hosted/open source?
├── Yes → BGE-large or E5-large-v2
└── No
└── Need multilingual?
├── Yes → Cohere embed-v3
└── No → OpenAI text-embedding-3-small
Embedding Optimization
| Technique | Description | When to Use |
|---|---|---|
| Matryoshka embeddings | Truncatable to smaller dims | Memory-constrained |
| Quantized embeddings | INT8/binary embeddings | Large-scale search |
| Instruction-tuned | Prefix with task instruction | Specialized retrieval |
| Fine-tuned embeddings | Domain-specific training | Specialized domains |
Retrieval Strategies
Dense Retrieval (Semantic Search)
Query: "How to deploy containers"
│
▼
┌─────────┐
│ Embed │
│ Query │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ Vector Similarity Search │
│ (Cosine, Dot Product, L2) │
└─────────────────────────────────┘
│
▼
Top-K semantically similar chunks
Sparse Retrieval (BM25/TF-IDF)
Query: "Kubernetes pod deployment YAML"
│
▼
┌─────────┐
│Tokenize │
│ + Score │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ BM25 Ranking │
│ (Term frequency × IDF) │
└─────────────────────────────────┘
│
▼
Top-K lexically matching chunks
Hybrid Search (Best of Both)
Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
│ │ │
└──▶ Sparse Search ─┘ │
│
Fusion Methods: ▼
• RRF (Reciprocal Rank Fusion)
• Linear combination
• Learned reranking
Reciprocal Rank Fusion (RRF)
RRF Score = Σ 1 / (k + rank_i)
Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result
Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
Result: Doc B ranks higher (better combined relevance)
Multi-Stage Retrieval
Two-Stage Pipeline
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall) │
│ • ANN search (HNSW, IVF) │
│ • Retrieve top-100 candidates │
│ • Latency: 10-50ms │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision) │
│ • Cross-encoder or LLM reranking │
│ • Score top-100 → return top-10 │
│ • Latency: 100-500ms │
└─────────────────────────────────────────────────────────┘
Reranking Options
| Reranker | Latency | Quality | Cost |
|---|---|---|---|
| Cross-encoder (local) | Medium | High | Compute |
| Cohere Rerank | Fast | High | API cost |
| LLM-based rerank | Slow | Highest | High API cost |
| BGE-reranker | Fast | Good | Compute |
Context Assembly
Context Window Management
Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)
Strategy: Maximize retrieved context quality within budget
Context Assembly Strategies
| Strategy | Description | When to Use |
|---|---|---|
| Simple concatenation | Join top-K chunks | Small context, simple Q&A |
| Relevance-ordered | Most relevant first | General retrieval |
| Chronological | Time-ordered | Temporal queries |
| Hierarchical | Summary + details | Long-form generation |
| Interleaved | Mix sources | Multi-source queries |
Lost-in-the-Middle Problem
LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning Middle End │
│ ████ ░░░░ ████ │
│ High attention Low attention High attention │
└─────────────────────────────────────────────────────────┘
Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention
Advanced RAG Patterns
Query Transformation
Original Query: "Tell me about the project"
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ HyDE │ │ Query │ │ Sub-query│
│ (Hypo │ │ Expansion│ │ Decomp. │
│ Doc) │ │ │ │ │
└─────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
Hypothetical "project, "What is the
answer to goals, project scope?"
embed timeline, "What are the
deliverables" deliverables?"
HyDE (Hypothetical Document Embeddings)
Query: "How does photosynthesis work?"
│
▼
┌───────────────┐
│ LLM generates │
│ hypothetical │
│ answer │
└───────────────┘
│
▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
│
▼
┌───────────────┐
│ Embed hypo │
│ document │
└───────────────┘
│
▼
Search with hypothetical embedding
(Better matches actual documents)
Self-RAG (Retrieval-Augmented LM with Self-Reflection)
┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response │
│ 2. Decide: Need more retrieval? (critique token) │
│ ├── Yes → Retrieve more, regenerate │
│ └── No → Check factuality (isRel, isSup tokens) │
│ 3. Verify claims against sources │
│ 4. Regenerate if needed │
│ 5. Return verified response │
└─────────────────────────────────────────────────────────┘
Agentic RAG
Query: "Compare Q3 revenue across regions"
│
▼
┌───────────────┐
│ Query Agent │
│ (Plan steps) │
└───────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Search │ │Search │ │Search │
│ EMEA │ │ APAC │ │ AMER │
│ docs │ │ docs │ │ docs │
└───────┘ └───────┘ └───────┘
│ │ │
└───────────┼───────────┘
▼
┌───────────────┐
│ Synthesize │
│ Comparison │
└───────────────┘
Evaluation Metrics
Retrieval Metrics
| Metric | Description | Target |
|---|---|---|
| Recall@K | % relevant docs in top-K | >80% |
| Precision@K | % of top-K that are relevant | >60% |
| MRR (Mean Reciprocal Rank) | 1/rank of first relevant | >0.5 |
| NDCG | Graded relevance ranking | >0.7 |
End-to-End Metrics
| Metric | Description | Target |
|---|---|---|
| Answer correctness | Is the answer factually correct? | >90% |
| Faithfulness | Is the answer grounded in context? | >95% |
| Answer relevance | Does it answer the question? | >90% |
| Context relevance | Is retrieved context relevant? | >80% |
Evaluation Framework
┌─────────────────────────────────────────────────────────┐
│ RAG Evaluation Pipeline │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions │
│ 2. Ground Truth: Expected answers + source docs │
│ 3. Metrics: │
│ • Retrieval: Recall@K, MRR, NDCG │
│ • Generation: Correctness, Faithfulness │
│ 4. A/B Testing: Compare configurations │
│ 5. Error Analysis: Identify failure patterns │
└─────────────────────────────────────────────────────────┘
Common Failure Modes
| Failure Mode | Cause | Mitigation |
|---|---|---|
| Retrieval miss | Query-doc mismatch | Hybrid search, query expansion |
| Wrong chunk | Poor chunking | Better segmentation, overlap |
| Hallucination | Poor grounding | Faithfulness training, citations |
| Lost context | Long-context issues | Hierarchical, summarization |
| Stale data | Outdated index | Incremental updates, TTL |
Scaling Considerations
Index Scaling
| Scale | Approach |
|---|---|
| <1M docs | Single node, exact search |
| 1-10M docs | Single node, HNSW |
| 10-100M docs | Distributed, sharded |
| >100M docs | Distributed + aggressive filtering |
Latency Budget
Typical RAG Pipeline Latency:
Query embedding: 10-50ms
Vector search: 20-100ms
Reranking: 100-300ms
LLM generation: 500-2000ms
────────────────────────────
Total: 630-2450ms
Target p95: <3 seconds for interactive use
Related Skills
llm-serving-patterns- LLM inference infrastructurevector-databases- Vector store selection and optimizationml-system-design- End-to-end ML pipeline designestimation-techniques- Capacity planning for RAG systems
Version History
- v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design
Last Updated
Date: 2025-12-26
Weekly Installs
4
Repository
melodic-software/claude-code-pluginsInstalled on
antigravity3
windsurf2
trae2
opencode2
codex2
claude-code2