rag_implementation
RAG Implementation
Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.
When to Use This Skill
- Building Q&A systems over proprietary documents
- Creating chatbots with current, factual information
- Implementing semantic search with natural language queries
- Reducing hallucinations with grounded responses
- Enabling LLMs to access domain-specific knowledge
- Building documentation assistants
- Creating research tools with source citation
Core Components
1. Vector Databases
Purpose: Store and retrieve document embeddings efficiently
Options:
- Pinecone: Managed, scalable, fast queries
- Weaviate: Open-source, hybrid search
- Milvus: High performance, on-premise
- Chroma: Lightweight, easy to use
- Qdrant: Fast, filtered search
- FAISS: Meta's library, local deployment
2. Embeddings
Purpose: Convert text to numerical vectors for similarity search
Models:
- text-embedding-ada-002 (OpenAI): General purpose, 1536 dims
- all-MiniLM-L6-v2 (Sentence Transformers): Fast, lightweight
- e5-large-v2: High quality, multilingual
- Instructor: Task-specific instructions
- bge-large-en-v1.5: SOTA performance
3. Retrieval Strategies
Approaches:
- Dense Retrieval: Semantic similarity via embeddings
- Sparse Retrieval: Keyword matching (BM25, TF-IDF)
- Hybrid Search: Combine dense + sparse
- Multi-Query: Generate multiple query variations
- HyDE: Generate hypothetical documents
4. Reranking
Purpose: Improve retrieval quality by reordering results
Methods:
- Cross-Encoders: BERT-based reranking
- Cohere Rerank: API-based reranking
- Maximal Marginal Relevance (MMR): Diversity + relevance
- LLM-based: Use LLM to score relevance
Quick Start
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.txt")
documents = loader.load()
# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_documents(documents)
# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
)
# 5. Query
result = qa_chain({"query": "What are the main features?"})
print(result['result'])
print(result['source_documents'])
Advanced RAG Patterns
Pattern 1: Hybrid Search
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Dense retriever (embeddings)
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine with weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, embedding_retriever],
weights=[0.3, 0.7]
)
Pattern 2: Multi-Query Retrieval
from langchain.retrievers.multi_query import MultiQueryRetriever
# Generate multiple query perspectives
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=OpenAI()
)
# Single query → multiple variations → combined results
results = retriever.get_relevant_documents("What is the main topic?")
Pattern 3: Contextual Compression
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
# Returns only relevant parts of documents
compressed_docs = compression_retriever.get_relevant_documents("query")
Pattern 4: Parent Document Retriever
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Store for parent documents
store = InMemoryStore()
# Small chunks for retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
Document Chunking Strategies
Recursive Character Text Splitter
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
Token-Based Splitting
from langchain.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50
)
Semantic Chunking
from langchain.text_splitters import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
Markdown Header Splitter
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
Vector Store Configurations
Pinecone
import pinecone
from langchain.vectorstores import Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("your-index-name")
vectorstore = Pinecone(index, embeddings.embed_query, "text")
Weaviate
import weaviate
from langchain.vectorstores import Weaviate
client = weaviate.Client("http://localhost:8080")
vectorstore = Weaviate(client, "Document", "content", embeddings)
Chroma (Local)
from langchain.vectorstores import Chroma
vectorstore = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
Retrieval Optimization
1. Metadata Filtering
# Add metadata during indexing
chunks_with_metadata = []
for i, chunk in enumerate(chunks):
chunk.metadata = {
"source": chunk.metadata.get("source"),
"page": i,
"category": determine_category(chunk.page_content)
}
chunks_with_metadata.append(chunk)
# Filter during retrieval
results = vectorstore.similarity_search(
"query",
filter={"category": "technical"},
k=5
)
2. Maximal Marginal Relevance
# Balance relevance with diversity
results = vectorstore.max_marginal_relevance_search(
"query",
k=5,
fetch_k=20, # Fetch 20, return top 5 diverse
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
3. Reranking with Cross-Encoder
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Get initial results
candidates = vectorstore.similarity_search("query", k=20)
# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Sort by score and take top k
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
Prompt Engineering for RAG
Contextual Prompt
prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
With Citations
prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.
Context:
{context}
Question: {question}
Answer (with citations):"""
With Confidence
prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.
Context:
{context}
Question: {question}
Answer:
Confidence:"""
Evaluation Metrics
def evaluate_rag_system(qa_chain, test_cases):
metrics = {
'accuracy': [],
'retrieval_quality': [],
'groundedness': []
}
for test in test_cases:
result = qa_chain({"query": test['question']})
# Check if answer matches expected
accuracy = calculate_accuracy(result['result'], test['expected'])
metrics['accuracy'].append(accuracy)
# Check if relevant docs were retrieved
retrieval_quality = evaluate_retrieved_docs(
result['source_documents'],
test['relevant_docs']
)
metrics['retrieval_quality'].append(retrieval_quality)
# Check if answer is grounded in context
groundedness = check_groundedness(
result['result'],
result['source_documents']
)
metrics['groundedness'].append(groundedness)
return {k: sum(v)/len(v) for k, v in metrics.items()}
Resources
- references/vector-databases.md: Detailed comparison of vector DBs
- references/embeddings.md: Embedding model selection guide
- references/retrieval-strategies.md: Advanced retrieval techniques
- references/reranking.md: Reranking methods and when to use them
- references/context-window.md: Managing context limits
- assets/vector-store-config.yaml: Configuration templates
- assets/retriever-pipeline.py: Complete RAG pipeline
- assets/embedding-models.md: Model comparison and benchmarks
Best Practices
- Chunk Size: Balance between context and specificity (500-1000 tokens)
- Overlap: Use 10-20% overlap to preserve context at boundaries
- Metadata: Include source, page, timestamp for filtering and debugging
- Hybrid Search: Combine semantic and keyword search for best results
- Reranking: Improve top results with cross-encoder
- Citations: Always return source documents for transparency
- Evaluation: Continuously test retrieval quality and answer accuracy
- Monitoring: Track retrieval metrics in production
Common Issues
- Poor Retrieval: Check embedding quality, chunk size, query formulation
- Irrelevant Results: Add metadata filtering, use hybrid search, rerank
- Missing Information: Ensure documents are properly indexed
- Slow Queries: Optimize vector store, use caching, reduce k RAG Implementation v1.1 - Enhanced
🔄 Workflow
Kaynak: Pinecone Learning Center & LlamaIndex Docs
Aşama 1: Ingestion Pipeline (ETL for AI)
- Loaders:
UnstructuredveyaLlamaParsekullanarak PDF, HTML, MD dosyalarından temiz metin çıkar (Tabloları koru). - Metadata Extraction: Her chunk'a başlık, tarih, yazar gibi metadata ekle (Filtering için kritik).
- Embedding:
text-embedding-3-small(OpenAI) veyabge-m3(Open Source) gibi modern, çok dilli modeller kullan.
Aşama 2: Advanced Retrieval Implementation
- Vector Store: Üretim için Pinecone/Weaviate, lokal test için Chroma/FAISS kullan. Namespace ayrımı yap (Multi-tenancy).
- Reranking: Vektör aramasından gelen ilk 50 sonucu,
Cohere Rerankveyabge-rerankerile yeniden sırala ve ilk 5'i al (Precision artışı). - GraphRAG: İlişkisel bilgi önemliyse (Kim kimi tanıyor?), Vektör DB yanında Knowledge Graph (Neo4j) kullan.
Aşama 3: Traceability & Production
- Observability: LangSmith veya Arize Phoenix ile her adımı (Retrieve -> Rerank -> Generate) logla.
- Caching: Sık sorulan sorular için Semantic Cache (GPTCache) kullan (Maliyet ve hız optimizasyonu).
- Streaming: Cevabı kelime kelime (Streaming) döndürerek algılanan gecikmeyi düşür.
Kontrol Noktaları
| Aşama | Doğrulama |
|---|---|
| 1 | Metadata filtreleri çalışıyor mu? (Örn: Sadece 2024 yılına ait dokümanlardan ara). |
| 2 | Reranker entegrasyonu MRR (Mean Reciprocal Rank) skorunu artırdı mı? |
| 3 | PII (Kişisel Veri) sızdırma riski var mı? (Redaction mekanizması). |
More from vuralserhat86/antigravity-agentic-skills
skill_creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
37huggingface_transformers
Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference.
22responsive_design
Build responsive, mobile-first layouts using fluid containers, flexible units, media queries, and touch-friendly design that works across all screen sizes. Use this skill when creating or modifying UI layouts, responsive grids, breakpoint styles, mobile navigation, or any interface that needs to adapt to different screen sizes. Apply when working with responsive CSS, media queries, viewport settings, flexbox/grid layouts, mobile-first styling, breakpoint definitions (mobile, tablet, desktop), touch target sizing, relative units (rem, em, %), image optimization for different screens, or testing layouts across multiple devices. Use for any task involving multi-device support, responsive design patterns, or adaptive layouts.
20cache_patterns
Instruction set for enabling and operating the Spring Cache abstraction in Spring Boot when implementing application-level caching for performance-sensitive workloads.
16zustand_state
Production-tested setup for Zustand state management in React. Includes patterns for persistence, devtools, and TypeScript patterns. Prevents hydration mismatches and render loops.
14vitest_runner
Modern JavaScript/TypeScript testing with Vitest including mocking and coverage.
13