rag-architecture

SKILL.md

RAG Architecture

When to Use This Skill

Use this skill when:

  • Designing RAG pipelines for LLM applications
  • Choosing chunking and embedding strategies
  • Optimizing retrieval quality and relevance
  • Building knowledge-grounded AI systems
  • Implementing hybrid search (dense + sparse)
  • Designing multi-stage retrieval pipelines

Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

RAG Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                       RAG Pipeline                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Ingestion  │    │   Indexing   │    │    Vector Store      │  │
│  │   Pipeline   │───▶│   Pipeline   │───▶│    (Embeddings)      │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    Documents           Chunks +                 Indexed             │
│                       Embeddings               Vectors              │
│                                                     │               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │    Query     │    │  Retrieval   │    │   Context Assembly   │  │
│  │  Processing  │───▶│   Engine     │───▶│   + Generation       │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    User Query          Top-K Chunks            LLM Response         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Document Ingestion Pipeline

Document Processing Steps

Raw Documents
┌─────────────┐
│   Extract   │ ← PDF, HTML, DOCX, Markdown
│   Content   │
└─────────────┘
┌─────────────┐
│   Clean &   │ ← Remove boilerplate, normalize
│  Normalize  │
└─────────────┘
┌─────────────┐
│   Chunk     │ ← Split into retrievable units
│  Documents  │
└─────────────┘
┌─────────────┐
│  Generate   │ ← Create vector representations
│ Embeddings  │
└─────────────┘
┌─────────────┐
│   Store     │ ← Persist vectors + metadata
│  in Index   │
└─────────────┘

Chunking Strategies

Strategy Comparison

Strategy Description Best For Chunk Size
Fixed-size Split by token/character count Simple documents 256-512 tokens
Sentence-based Split at sentence boundaries Narrative text Variable
Paragraph-based Split at paragraph boundaries Structured docs Variable
Semantic Split by topic/meaning Long documents Variable
Recursive Hierarchical splitting Mixed content Configurable
Document-specific Custom per doc type Specialized (code, tables) Variable

Chunking Decision Tree

What type of content?
├── Code
│   └── AST-based or function-level chunking
├── Tables/Structured
│   └── Keep tables intact, chunk surrounding text
├── Long narrative
│   └── Semantic or recursive chunking
├── Short documents (<1 page)
│   └── Whole document as chunk
└── Mixed content
    └── Recursive with type-specific handlers

Chunk Overlap

Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
               Information lost at boundary

With Overlap (20%):
[Chunk 1: "The quick brown fox"]
                    [Chunk 2: "brown fox jumps over"]
              Context preserved across boundaries

Recommended overlap: 10-20% of chunk size

Chunk Size Trade-offs

Smaller Chunks (128-256 tokens)        Larger Chunks (512-1024 tokens)
├── More precise retrieval             ├── More context per chunk
├── Less context per chunk             ├── May include irrelevant content
├── More chunks to search              ├── Fewer chunks to search
├── Better for factoid Q&A             ├── Better for summarization
└── Higher retrieval recall            └── Higher retrieval precision

Embedding Models

Model Comparison

Model Dimensions Context Strengths
OpenAI text-embedding-3-large 3072 8K High quality, expensive
OpenAI text-embedding-3-small 1536 8K Good quality/cost ratio
Cohere embed-v3 1024 512 Multilingual, fast
BGE-large 1024 512 Open source, competitive
E5-large-v2 1024 512 Open source, instruction-tuned
GTE-large 1024 512 Alibaba, good for Chinese
Sentence-BERT 768 512 Classic, well-understood

Embedding Selection

Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
    └── Need self-hosted/open source?
        ├── Yes → BGE-large or E5-large-v2
        └── No
            └── Need multilingual?
                ├── Yes → Cohere embed-v3
                └── No → OpenAI text-embedding-3-small

Embedding Optimization

Technique Description When to Use
Matryoshka embeddings Truncatable to smaller dims Memory-constrained
Quantized embeddings INT8/binary embeddings Large-scale search
Instruction-tuned Prefix with task instruction Specialized retrieval
Fine-tuned embeddings Domain-specific training Specialized domains

Retrieval Strategies

Dense Retrieval (Semantic Search)

Query: "How to deploy containers"
    ┌─────────┐
    │ Embed   │
    │ Query   │
    └─────────┘
    ┌─────────────────────────────────┐
    │ Vector Similarity Search        │
    │ (Cosine, Dot Product, L2)       │
    └─────────────────────────────────┘
    Top-K semantically similar chunks

Sparse Retrieval (BM25/TF-IDF)

Query: "Kubernetes pod deployment YAML"
    ┌─────────┐
    │Tokenize │
    │ + Score │
    └─────────┘
    ┌─────────────────────────────────┐
    │ BM25 Ranking                    │
    │ (Term frequency × IDF)          │
    └─────────────────────────────────┘
    Top-K lexically matching chunks

Hybrid Search (Best of Both)

Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
        │                   │      │
        └──▶ Sparse Search ─┘      │
        Fusion Methods:            ▼
        • RRF (Reciprocal Rank Fusion)
        • Linear combination
        • Learned reranking

Reciprocal Rank Fusion (RRF)

RRF Score = Σ 1 / (k + rank_i)

Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result

Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

Multi-Stage Retrieval

Two-Stage Pipeline

┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall)                     │
│ • ANN search (HNSW, IVF)                                │
│ • Retrieve top-100 candidates                           │
│ • Latency: 10-50ms                                      │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision)                  │
│ • Cross-encoder or LLM reranking                        │
│ • Score top-100 → return top-10                         │
│ • Latency: 100-500ms                                    │
└─────────────────────────────────────────────────────────┘

Reranking Options

Reranker Latency Quality Cost
Cross-encoder (local) Medium High Compute
Cohere Rerank Fast High API cost
LLM-based rerank Slow Highest High API cost
BGE-reranker Fast Good Compute

Context Assembly

Context Window Management

Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget

Context Assembly Strategies

Strategy Description When to Use
Simple concatenation Join top-K chunks Small context, simple Q&A
Relevance-ordered Most relevant first General retrieval
Chronological Time-ordered Temporal queries
Hierarchical Summary + details Long-form generation
Interleaved Mix sources Multi-source queries

Lost-in-the-Middle Problem

LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning           Middle            End               │
│    ████              ░░░░             ████              │
│  High attention   Low attention   High attention        │
└─────────────────────────────────────────────────────────┘

Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention

Advanced RAG Patterns

Query Transformation

Original Query: "Tell me about the project"
         ┌─────────────────┼─────────────────┐
         ▼                 ▼                 ▼
    ┌─────────┐      ┌──────────┐     ┌──────────┐
    │ HyDE    │      │ Query    │     │ Sub-query│
    │ (Hypo   │      │ Expansion│     │ Decomp.  │
    │ Doc)    │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
         │                 │                 │
         ▼                 ▼                 ▼
    Hypothetical      "project,        "What is the
    answer to         goals,           project scope?"
    embed             timeline,        "What are the
                      deliverables"    deliverables?"

HyDE (Hypothetical Document Embeddings)

Query: "How does photosynthesis work?"
        ┌───────────────┐
        │ LLM generates │
        │ hypothetical  │
        │ answer        │
        └───────────────┘
"Photosynthesis is the process by which
plants convert sunlight into energy..."
        ┌───────────────┐
        │ Embed hypo    │
        │ document      │
        └───────────────┘
    Search with hypothetical embedding
    (Better matches actual documents)

Self-RAG (Retrieval-Augmented LM with Self-Reflection)

┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response                            │
│ 2. Decide: Need more retrieval? (critique token)        │
│    ├── Yes → Retrieve more, regenerate                  │
│    └── No → Check factuality (isRel, isSup tokens)      │
│ 3. Verify claims against sources                        │
│ 4. Regenerate if needed                                 │
│ 5. Return verified response                             │
└─────────────────────────────────────────────────────────┘

Agentic RAG

Query: "Compare Q3 revenue across regions"
        ┌───────────────┐
        │ Query Agent   │
        │ (Plan steps)  │
        └───────────────┘
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐   ┌───────┐   ┌───────┐
│Search │   │Search │   │Search │
│ EMEA  │   │ APAC  │   │ AMER  │
│ docs  │   │ docs  │   │ docs  │
└───────┘   └───────┘   └───────┘
    │           │           │
    └───────────┼───────────┘
        ┌───────────────┐
        │  Synthesize   │
        │  Comparison   │
        └───────────────┘

Evaluation Metrics

Retrieval Metrics

Metric Description Target
Recall@K % relevant docs in top-K >80%
Precision@K % of top-K that are relevant >60%
MRR (Mean Reciprocal Rank) 1/rank of first relevant >0.5
NDCG Graded relevance ranking >0.7

End-to-End Metrics

Metric Description Target
Answer correctness Is the answer factually correct? >90%
Faithfulness Is the answer grounded in context? >95%
Answer relevance Does it answer the question? >90%
Context relevance Is retrieved context relevant? >80%

Evaluation Framework

┌─────────────────────────────────────────────────────────┐
│                RAG Evaluation Pipeline                  │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions                  │
│ 2. Ground Truth: Expected answers + source docs         │
│ 3. Metrics:                                             │
│    • Retrieval: Recall@K, MRR, NDCG                     │
│    • Generation: Correctness, Faithfulness              │
│ 4. A/B Testing: Compare configurations                  │
│ 5. Error Analysis: Identify failure patterns            │
└─────────────────────────────────────────────────────────┘

Common Failure Modes

Failure Mode Cause Mitigation
Retrieval miss Query-doc mismatch Hybrid search, query expansion
Wrong chunk Poor chunking Better segmentation, overlap
Hallucination Poor grounding Faithfulness training, citations
Lost context Long-context issues Hierarchical, summarization
Stale data Outdated index Incremental updates, TTL

Scaling Considerations

Index Scaling

Scale Approach
<1M docs Single node, exact search
1-10M docs Single node, HNSW
10-100M docs Distributed, sharded
>100M docs Distributed + aggressive filtering

Latency Budget

Typical RAG Pipeline Latency:

Query embedding:     10-50ms
Vector search:       20-100ms
Reranking:          100-300ms
LLM generation:     500-2000ms
────────────────────────────
Total:              630-2450ms

Target p95: <3 seconds for interactive use

Related Skills

  • llm-serving-patterns - LLM inference infrastructure
  • vector-databases - Vector store selection and optimization
  • ml-system-design - End-to-end ML pipeline design
  • estimation-techniques - Capacity planning for RAG systems

Version History

  • v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design

Last Updated

Date: 2025-12-26

Weekly Installs
4
Installed on
antigravity3
windsurf2
trae2
opencode2
codex2
claude-code2