RAG Implementer

Build production-ready retrieval-augmented generation systems. RAG = Retrieval + Context Assembly + Generation. Use RAG when LLMs need access to fresh, domain-specific, or proprietary knowledge not in their training data. Do not use RAG when simpler alternatives (FAQ pages, keyword search, semantic search) suffice. For KB architecture selection and governance, use the knowledge-base-manager skill. For knowledge graph implementation, use the knowledge-graph-builder skill.

Overview

Before building RAG, validate the need: try FAQ pages, keyword search, concierge MVP, or simple semantic search first. Only proceed with RAG for 50k+ documents with validated user demand and $200-500/month budget. RAG systems range from Naive (prototype) through Advanced (production) to Modular (enterprise), each tier adding complexity and cost.

The RAG pipeline has three core stages. First, retrieval finds relevant documents using hybrid search (semantic + keyword). Second, context assembly ranks, deduplicates, and compresses retrieved chunks into an optimal prompt. Third, generation produces a grounded response with source attribution. Each stage has distinct failure modes: retrieval can miss relevant documents (low recall), context assembly can overwhelm the model (lost in the middle), and generation can hallucinate despite good context (low faithfulness).

Modern RAG extends beyond basic vector similarity. Hybrid search combining dense embeddings with sparse BM25 is now the baseline. Re-ranking with cross-encoders improves precision after initial retrieval. Contextual chunking and late chunking preserve document-level semantics that fixed-size chunking loses. GraphRAG enables multi-hop reasoning over entity relationships by building knowledge graphs from documents. Proposition chunking breaks documents into atomic facts for precise retrieval of individual claims.

Choose techniques based on your query complexity and document structure. Start with hybrid search and re-ranking as the foundation, then layer contextual chunking, GraphRAG, or query expansion as needed. Measure everything: Precision@K, Recall@K, faithfulness, and end-to-end latency. The difference between a good and bad chunking strategy alone can create a 9% gap in recall performance.

Quick Reference

Phase	Goal	Key Actions
1. Knowledge Base Design	Structured knowledge foundation	Map sources, define chunking, add metadata
2. Embedding Strategy	Semantic understanding	Select model, benchmark on domain data
3. Vector Store	Scalable storage	Choose DB, configure index, plan scaling
4. Retrieval Pipeline	Beyond simple similarity	Hybrid retrieval, query enhancement, re-ranking
5. Context Assembly	Optimal LLM context	Rank, synthesize, compress, mitigate "lost in the middle"
6. Evaluation	Measure performance	Precision@K, Recall@K, faithfulness, latency
7. Production Deploy	Enterprise reliability	Containerize, cache, graceful degradation, security
8. Continuous Improvement	Ongoing enhancement	Auto-updates, fine-tuning, optimization

Decision	Options
Vector DB (managed)	Pinecone
Vector DB (self-hosted)	Weaviate, Qdrant
Vector DB (lightweight)	Chroma
Vector DB (existing Postgres)	pgvector
Vector DB (billion-scale)	Milvus / Zilliz
Embedding (general)	`text-embedding-3-large` (3072 dim)
Embedding (cost-optimized)	`text-embedding-3-small` (1536 dim)
Embedding (code)	Voyage Code 3
Embedding (multilingual)	`multilingual-e5-large`, Cohere embed-v4
Chunking (fixed)	500-1000 tokens, 50-100 overlap
Chunking (semantic)	Paragraph/section/topic boundaries
Chunking (recursive)	Markdown headers, code blocks
Chunking (contextual)	LLM-generated summaries prepended to each chunk
Chunking (late)	Full-document embedding, then pool by chunk boundaries

Cost Tier	Time	Monthly Cost	Scale
Naive RAG (prototype)	1-2 weeks	$50-150	<10k documents
Advanced RAG (production)	3-4 weeks	$200-500	10k-1M documents
Modular RAG (enterprise)	6-8 weeks	$500-2000+	1M+ documents

Advanced Technique	When to Use
Hybrid search	Always -- combine semantic + keyword (BM25) for better recall
Re-ranking	When initial retrieval returns noisy results
Contextual retrieval	Documents with ambiguous references or pronouns
Late chunking	Efficiency-focused pipelines with anaphoric references
GraphRAG	Multi-hop reasoning over structured knowledge relationships
Proposition chunking	Fact-dense documents requiring atomic retrieval units
Query expansion / HyDE	Queries that are short, ambiguous, or under-specified

Common Mistakes

Mistake	Correct Pattern
Building RAG before validating user need	Try simpler alternatives first (FAQ, keyword search, concierge MVP); only build RAG with validated demand
Using a single retrieval method (semantic only)	Implement hybrid retrieval combining semantic search with keyword (BM25) for better recall
Dumping all available data into the knowledge base	Curate data sources carefully; filter noise, select authoritative content, and maintain quality
Ignoring the "lost in the middle" problem	Place critical information at the start and end of context; compress mid-section
Skipping evaluation metrics before production	Establish baselines for Precision@K, Recall@K, faithfulness, and hallucination rate before deploying
Using `text-embedding-3-large` at full 3072 dimensions without benchmarking	Test at reduced dimensions (1024 or 1536) first -- often comparable accuracy at lower cost
Fixed-size chunking for all document types	Match chunking strategy to document structure; use semantic or recursive chunking for structured content
Ignoring metadata filtering	Attach rich metadata (source, date, category) and filter before or during vector search

Embedding Model Notes

text-embedding-3-large (3072 dimensions) remains OpenAI's most capable embedding model. It supports Matryoshka dimensionality reduction via the dimensions API parameter -- 1024 dimensions often delivers near-full accuracy at one-third storage cost. text-embedding-3-small (1536 dimensions) is a cost-effective alternative at $0.02 per million tokens. For code search, Voyage Code 3 outperforms general-purpose models. For multilingual workloads, consider multilingual-e5-large or Cohere embed-v4. Always benchmark on your domain data; general benchmarks do not predict domain-specific performance.

Vector Store Notes

Pinecone for managed simplicity, Weaviate or Qdrant for self-hosted with hybrid search, Chroma for prototyping, pgvector for teams already on PostgreSQL (practical limit around 10-100M vectors), and Milvus/Zilliz for billion-scale deployments. Choose index type based on tradeoffs: HNSW for speed (higher memory), IVF for scale (requires training), flat for exact search on small datasets only.

Most vector databases now achieve 10-100ms query latency on 1-10M vector datasets. Start with the simplest option that fits your scale requirements and migrate only when you hit concrete performance limits.

Delegation

Discover data sources and assess knowledge base quality: Use Explore agent to catalog documents, evaluate data freshness, and identify authoritative content
Implement retrieval pipeline with hybrid search and re-ranking: Use Task agent to build embedding, indexing, retrieval, and evaluation components
Design RAG architecture and vector store topology: Use Plan agent to select embedding models, vector databases, chunking strategies, and deployment architecture

For KB architecture selection, curation workflows, and governance, use the knowledge-base-manager skill. For knowledge graph implementation (ontology, entity extraction, graph databases), use the knowledge-graph-builder skill.

rag-implementer