rag-pipeline-builder
SKILL.md
RAG Pipeline Builder
Design end-to-end RAG pipelines for accurate document retrieval and generation.
Pipeline Architecture
Documents → Chunking → Embedding → Vector Store → Retrieval → Reranking → Generation
Chunking Strategy
# Semantic chunking (recommended)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between chunks
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = splitter.split_text(document.text)
# Add metadata to each chunk
for i, chunk in enumerate(chunks):
chunks[i] = {
"text": chunk,
"metadata": {
"source": document.filename,
"page": calculate_page(i),
"chunk_id": f"{document.id}_chunk_{i}",
}
}
Metadata Schema
interface ChunkMetadata {
// Source information
document_id: string;
source: string;
url?: string;
// Location
page?: number;
section?: string;
chunk_index: number;
// Content classification
content_type: "text" | "code" | "table" | "list";
language?: string;
// Timestamps
created_at: Date;
updated_at: Date;
// Retrieval optimization
keywords: string[];
summary?: string;
importance_score?: number;
}
Vector Store Setup
# Pinecone example
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
pinecone.init(api_key="...", environment="...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Pinecone.from_documents(
documents=chunks,
embedding=embeddings,
index_name="knowledge-base",
namespace="production",
)
Retrieval Strategies
# Hybrid search (dense + sparse)
def hybrid_retrieval(query: str, k: int = 5):
# Dense retrieval (semantic)
dense_results = vectorstore.similarity_search(query, k=k*2)
# Sparse retrieval (keyword - BM25)
sparse_results = bm25_search(query, k=k*2)
# Combine and rerank
combined = reciprocal_rank_fusion(dense_results, sparse_results)
return combined[:k]
# Metadata filtering
results = vectorstore.similarity_search(
query,
k=5,
filter={
"content_type": "code",
"language": "python",
}
)
Reranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_results(query: str, results: List[Document], top_k: int = 3):
# Score each result against query
pairs = [(query, doc.page_content) for doc in results]
scores = reranker.predict(pairs)
# Sort by score
scored_results = list(zip(results, scores))
scored_results.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in scored_results[:top_k]]
Query Enhancement
# Query expansion
def expand_query(query: str) -> str:
expansion_prompt = f"""
Generate 3 alternative phrasings of this query:
"{query}"
Return as JSON array of strings.
"""
alternatives = llm(expansion_prompt)
return [query] + alternatives
# Multi-query retrieval
def multi_query_retrieval(query: str, k: int = 5):
queries = expand_query(query)
all_results = []
for q in queries:
results = vectorstore.similarity_search(q, k=k)
all_results.extend(results)
# Deduplicate and rerank
unique_results = deduplicate(all_results)
return rerank_results(query, unique_results, k)
Evaluation Plan
# Define golden dataset
golden_dataset = [
{
"query": "How do I authenticate users?",
"expected_docs": ["auth_guide.md", "user_management.md"],
"relevant_chunks": ["chunk_123", "chunk_456"],
},
]
# Metrics
def evaluate_retrieval(dataset):
results = {
"precision": [],
"recall": [],
"mrr": [], # Mean Reciprocal Rank
"ndcg": [] # Normalized Discounted Cumulative Gain
}
for item in dataset:
retrieved = retrieval_fn(item["query"])
retrieved_ids = [doc.metadata["chunk_id"] for doc in retrieved]
# Calculate metrics
relevant = set(item["relevant_chunks"])
retrieved_set = set(retrieved_ids)
precision = len(relevant & retrieved_set) / len(retrieved_set)
recall = len(relevant & retrieved_set) / len(relevant)
results["precision"].append(precision)
results["recall"].append(recall)
return {k: sum(v)/len(v) for k, v in results.items()}
Context Window Management
def fit_context_window(chunks: List[Document], max_tokens: int = 4000):
"""Select chunks that fit in context window"""
total_tokens = 0
selected_chunks = []
for chunk in chunks:
chunk_tokens = count_tokens(chunk.page_content)
if total_tokens + chunk_tokens <= max_tokens:
selected_chunks.append(chunk)
total_tokens += chunk_tokens
else:
break
return selected_chunks
Best Practices
- Chunk size: 500-1000 chars for general text
- Overlap: 10-20% overlap between chunks
- Metadata: Rich metadata for filtering
- Hybrid search: Combine semantic + keyword
- Reranking: Cross-encoder for final ranking
- Evaluation: Golden dataset with metrics
- Context management: Don't exceed model limits
Output Checklist
- Chunking strategy defined
- Metadata schema documented
- Vector store configured
- Retrieval algorithm implemented
- Reranking pipeline added
- Query enhancement (optional)
- Context window management
- Evaluation dataset created
- Metrics implementation
- Performance baseline established
Weekly Installs
10
Repository
patricio0312rev/skillsFirst Seen
10 days ago
Installed on
claude-code8
trae7
gemini-cli7
antigravity7
github-copilot7
windsurf7