RAG Architecture Skill

Build retrieval-augmented generation systems that ground LLMs in your data.

Core Principle

RAG = Retrieval + Generation. Instead of relying solely on the model's training data, retrieve relevant context at query time and include it in the prompt. This reduces hallucination and enables access to private/current data.

RAG Pipeline

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Query     │────▶│   Embed     │────▶│  Retrieve   │────▶│   Augment   │
│  "How do I  │     │  Query to   │     │  Top-K      │     │  Add to     │
│   deploy?"  │     │  Vector     │     │  Documents  │     │  Prompt     │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                                                   │
                                                                   ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Answer    │◀────│  Generate   │◀────│   Format    │◀────│  Context    │
│  Grounded   │     │  With LLM   │     │   Prompt    │     │  + Query    │
│  Response   │     │             │     │             │     │             │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Indexing Pipeline

Document Processing

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│    Load      │────▶│    Clean     │────▶│    Chunk     │────▶│    Embed     │
│  Documents   │     │  & Parse     │     │   Content    │     │   Chunks     │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
                                                                      │
                                                                      ▼
                                                               ┌──────────────┐
                                                               │    Store     │
                                                               │  in Vector   │
                                                               │     DB       │
                                                               └──────────────┘

Chunking Strategies

Strategy	Description	Best For
Fixed Size	Split every N tokens/chars	Simple, predictable
Sentence	Split on sentence boundaries	Natural breaks
Paragraph	Split on paragraph breaks	Coherent units
Semantic	Split on topic changes	Meaningful segments
Recursive	Try large, fall back to smaller	Mixed content
Document	Keep whole documents	Short docs

# Recursive chunking example
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,  # Overlap prevents losing context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)

Chunk Size Tradeoffs

Size	Pros	Cons
Small (100-500)	Precise retrieval	May lose context
Medium (500-1500)	Balanced	Good default
Large (1500-3000)	Full context	Less precise, costly

Rule of thumb: Chunk should contain enough context to be useful standalone.

Embedding Models

Model Comparison

Model	Dimensions	Speed	Quality	Cost
text-embedding-3-small	1536	Fast	Good	Low
text-embedding-3-large	3072	Medium	Best	Medium
ada-002	1536	Fast	Good	Low
Cohere embed-v3	1024	Fast	Good	Low
BGE-large	1024	Local	Good	Free
E5-large	1024	Local	Good	Free

Embedding Best Practices

# Normalize embeddings for cosine similarity
import numpy as np

def normalize(embedding):
    return embedding / np.linalg.norm(embedding)

# Batch embeddings for efficiency
embeddings = embed_model.embed_documents(chunks)  # Not one at a time

# Cache embeddings - don't re-embed unchanged content

Vector Databases

Options

Database	Type	Strengths	Use Case
Pinecone	Managed	Easy, scalable	Production
Weaviate	Managed/Self	Hybrid search	Enterprise
Qdrant	Self-hosted	Performance	Privacy-sensitive
Chroma	Embedded	Simple, local	Prototyping
pgvector	PostgreSQL ext	SQL + vectors	Existing Postgres
Azure AI Search	Managed	M365 integration	Azure ecosystem
FAISS	Library	Fast, offline	Local/research

Index Types

Index	Speed	Accuracy	Memory
Flat (exact)	Slow	100%	High
IVF	Fast	~95%	Medium
HNSW	Very fast	~98%	High
PQ	Very fast	~90%	Low

Retrieval Strategies

Basic Retrieval

# Simple top-k retrieval
results = vector_store.similarity_search(query, k=5)

Hybrid Search

Combine semantic (vector) with keyword (BM25) search:

# Reciprocal Rank Fusion
def hybrid_search(query, k=5, alpha=0.5):
    semantic_results = vector_search(query, k=k*2)
    keyword_results = bm25_search(query, k=k*2)

    # Fuse rankings
    scores = {}
    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + alpha * (1 / (rank + 60))
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + (1-alpha) * (1 / (rank + 60))

    return sorted(scores.items(), key=lambda x: -x[1])[:k]

Reranking

Two-stage retrieval for better precision:

# Stage 1: Fast retrieval (get candidates)
candidates = vector_store.similarity_search(query, k=20)

# Stage 2: Rerank with cross-encoder (more accurate)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.content) for doc in candidates])
reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]

Query Transformation

# Hypothetical Document Embedding (HyDE)
def hyde_search(query):
    # Generate hypothetical answer
    hypothetical = llm.generate(f"Write a passage that answers: {query}")
    # Search using the hypothetical (often better match)
    return vector_store.similarity_search(hypothetical, k=5)

# Multi-query retrieval
def multi_query_search(query):
    # Generate query variations
    variations = llm.generate(f"Generate 3 different ways to ask: {query}")
    # Search with each, combine results
    all_results = []
    for q in variations:
        all_results.extend(vector_store.similarity_search(q, k=3))
    return deduplicate(all_results)

Prompt Augmentation

Basic RAG Prompt

Use the following context to answer the question. If the context doesn't
contain the answer, say "I don't have information about that."

Context:
{retrieved_documents}

Question: {user_query}

Answer:

Structured RAG Prompt

You are answering questions based on the provided documentation.

RULES:
1. Only use information from the provided context
2. Quote relevant passages when possible
3. If the context doesn't contain the answer, say so
4. If information is partial, acknowledge limitations

CONTEXT:
---
Source: {doc1.source}
{doc1.content}
---
Source: {doc2.source}
{doc2.content}
---

QUESTION: {query}

Provide your answer with citations to the source documents.

Citation Handling

# Include source metadata
for i, doc in enumerate(retrieved_docs):
    context += f"[{i+1}] Source: {doc.metadata['source']}\n{doc.content}\n\n"

# Prompt for citations
prompt += "\nCite sources using [1], [2], etc."

Advanced Patterns

Parent Document Retrieval

Store small chunks for retrieval, but return larger parent context:

# Index small chunks (e.g., 200 tokens)
# But store mapping to parent (e.g., full section)

def retrieve_with_parent(query):
    small_chunks = vector_store.search(query, k=3)
    parent_ids = set(chunk.metadata['parent_id'] for chunk in small_chunks)
    return [doc_store.get(pid) for pid in parent_ids]

Self-Query Retrieval

Let LLM write the filter query:

# User: "What did we decide about authentication in 2024?"
# LLM generates: {"filter": {"year": 2024, "topic": "authentication"}}

retriever = SelfQueryRetriever(
    llm=llm,
    vectorstore=vectorstore,
    document_content_description="Meeting notes and decisions",
    metadata_field_info=[
        {"name": "year", "type": "integer"},
        {"name": "topic", "type": "string"},
    ]
)

Agentic RAG

Let an agent decide when/what to retrieve:

tools = [
    Tool("search_docs", "Search internal documentation", search_function),
    Tool("search_web", "Search the web for current info", web_search),
    Tool("search_code", "Search codebase", code_search),
]

agent = Agent(
    llm=llm,
    tools=tools,
    system_prompt="Decide which sources to search based on the question."
)

Evaluation Metrics

Retrieval Quality

Metric	Measures	Formula
Recall@K	Found relevant docs	Relevant in top-K / Total relevant
Precision@K	Top-K accuracy	Relevant in top-K / K
MRR	Rank of first relevant	1 / rank of first relevant
NDCG	Ranking quality	Normalized discounted cumulative gain

Generation Quality

Metric	Measures	How
Faithfulness	Grounded in context	Check claims against sources
Relevance	Answers the question	Human evaluation
Completeness	Covers all aspects	Human evaluation
Hallucination rate	Made-up facts	Compare to source docs

RAG Evaluation Tools

# Using ragas library
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

Common Pitfalls

Pitfall	Symptom	Solution
Chunks too small	Answers lack context	Increase chunk size or use parent retrieval
Chunks too large	Irrelevant content included	Decrease size, improve chunking
Wrong K value	Too much/little context	Tune K based on evaluation
No metadata	Can't filter results	Add source, date, topic metadata
Stale index	Outdated answers	Implement refresh pipeline
Ignoring retrieved context	Hallucinations	Improve prompt, lower temperature

Production Considerations

Caching

# Cache embeddings
embedding_cache = {}
def get_embedding(text):
    if text not in embedding_cache:
        embedding_cache[text] = embed_model.embed(text)
    return embedding_cache[text]

# Cache frequent queries
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_search(query_hash):
    return vector_store.search(query, k=5)

Monitoring

# Log retrieval quality
logger.info({
    "query": query,
    "retrieved_docs": [d.id for d in results],
    "retrieval_time_ms": elapsed,
    "rerank_time_ms": rerank_elapsed,
    "total_time_ms": total_elapsed
})

Cost Optimization

Optimization	Savings
Batch embeddings	API calls
Cache frequent queries	Compute + API
Use smaller embedding model	API cost
Compress vectors (PQ)	Storage
Filter before semantic search	Compute

Synapses

See synapses.json for connections.

RAG Architecture Skill

RAG Architecture Skill

Core Principle

RAG Pipeline

Indexing Pipeline

Document Processing

Chunking Strategies

Chunk Size Tradeoffs

Embedding Models

Model Comparison

Embedding Best Practices

Vector Databases

Options

Index Types

Retrieval Strategies

Basic Retrieval

Hybrid Search

Reranking

Query Transformation

Prompt Augmentation

Basic RAG Prompt

Structured RAG Prompt

Citation Handling

Advanced Patterns

Parent Document Retrieval

Self-Query Retrieval

Agentic RAG

Evaluation Metrics

Retrieval Quality

Generation Quality

RAG Evaluation Tools

Common Pitfalls

Production Considerations

Caching

Monitoring

Cost Optimization

Synapses

More from fabioc-aloha/windowswidget

prompt engineering skill

text-to-speech

socratic questioning skill

academic research skill

work-life balance skill

grant writing skill