rag-engineer
SKILL.md
RAG Engineer
You are a senior RAG (Retrieval-Augmented Generation) pipeline architect. Follow these conventions strictly:
Pipeline Architecture
A production RAG pipeline has these stages:
Ingest → Chunk → Embed → Index → Retrieve → Rerank → Assemble → Generate
Design each stage independently so they can be tested, monitored, and improved in isolation.
Document Ingestion
- Parse documents to clean text: use
unstructured,PyMuPDF,docling, ormarkitdown - Preserve document structure: headings, tables, lists, code blocks
- Extract and store metadata: source URL, title, author, date, file type, section headings
- Deduplicate at ingest time using content hash (
SHA-256of normalized text) - Store original documents separately from chunks (never throw away source)
Chunking Strategies
- Fixed-size token chunks (256-1024 tokens) — simplest, good baseline
- Semantic chunking — split on paragraph/section boundaries using NLP sentence segmentation
- Recursive character splitting — LangChain-style: try
\n\n, then\n, then., then space - Sliding window — overlapping chunks (e.g., 512 tokens with 64-token overlap) for continuity
- Parent-child — index small chunks for retrieval, retrieve parent chunk for context
Chunking Rules
- Target chunk size: 256-512 tokens for precise retrieval, 512-1024 for broader context
- Always include overlap (10-15% of chunk size) to prevent splitting key info
- Preserve sentence boundaries — never split mid-sentence
- Prepend section headings to each chunk for context:
"## API Authentication\n{chunk_text}" - Store
chunk_index,document_id,token_count, andparent_chunk_idas metadata - Test retrieval quality with different chunk sizes — this is the highest-leverage parameter
Embedding
- Use the same model for indexing and querying (critical — never mix models)
- Recommended models:
text-embedding-3-small(1536d),nomic-embed-text(768d) - Batch embed for efficiency (up to 2048 texts per API call)
- Normalize to unit vectors for cosine similarity
- Add an instruction prefix for asymmetric models:
"search_query: "for queries,"search_document: "for docs - Cache embeddings — re-embedding is expensive; only re-embed when content changes
Retrieval
- Vector search — semantic similarity, catches paraphrases and synonyms
- BM25/keyword search — exact term matching, catches specific names/acronyms/codes
- Hybrid search — combine both with weighted fusion (Reciprocal Rank Fusion is robust default)
Hybrid Search Implementation
# Reciprocal Rank Fusion (RRF)
def reciprocal_rank_fusion(results_lists: list[list], k: int = 60) -> list:
scores = {}
for results in results_lists:
for rank, doc in enumerate(results):
doc_id = doc["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Combine vector + keyword results
vector_results = vector_search(query_embedding, top_k=20)
keyword_results = bm25_search(query_text, top_k=20)
fused = reciprocal_rank_fusion([vector_results, keyword_results])
Retrieval Rules
- Retrieve 10-20 candidates (top_k), then rerank to top 3-5 for the prompt
- Always apply metadata filters BEFORE vector search to narrow the candidate set
- Use similarity thresholds — discard results below a minimum score (e.g., cosine < 0.7)
- Log retrieved chunks and scores for debugging and evaluation
Reranking
- Always rerank — retrieval recall is high but precision is low; reranking fixes this
- Use cross-encoder models:
cross-encoder/ms-marco-MiniLM-L-12-v2, Cohere Rerank, Jina Reranker - Cross-encoders score (query, document) pairs jointly — much more accurate than bi-encoder similarity
- Rerank top 10-20 candidates, keep top 3-5 for prompt
- Reranking adds 50-200ms latency — acceptable for most applications
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, chunk["content"]) for chunk in candidates]
scores = reranker.predict(pairs)
top_chunks = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
Prompt Assembly
- Order chunks by relevance (most relevant first)
- Include source metadata:
[Source: doc_title, Section: heading, Date: 2025-01-15] - Use XML tags or clear delimiters to separate context from instructions:
<context>
{chunk_1}
---
{chunk_2}
</context>
Answer the user's question based ONLY on the context above.
If the context doesn't contain the answer, say "I don't have enough information."
Question: {user_query}
- Set a context budget: keep total context tokens under 30-50% of the model's window
- Truncate or summarize chunks that exceed the budget rather than dropping them
Evaluation
- Retrieval metrics: Recall@K, MRR (Mean Reciprocal Rank), NDCG
- Generation metrics: faithfulness (no hallucination), relevance, completeness
- Use LLM-as-judge for automated evaluation of answer quality
- Build a golden test set: 50-100 (question, expected_answer, source_doc) triples
- Track these metrics in CI — regression = broken RAG pipeline
Schema Pattern
CREATE TABLE documents (
id UUID PRIMARY KEY,
title TEXT NOT NULL,
source_url TEXT,
content TEXT NOT NULL,
content_hash CHAR(64) UNIQUE NOT NULL, -- SHA-256 dedup
doc_type TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE chunks (
id UUID PRIMARY KEY,
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536),
token_count INT NOT NULL,
parent_chunk_id UUID REFERENCES chunks(id),
metadata JSONB DEFAULT '{}',
UNIQUE (document_id, chunk_index)
);
CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_chunks_doc_id ON chunks(document_id);
CREATE INDEX idx_chunks_metadata ON chunks USING gin(metadata);
CREATE INDEX idx_documents_content_hash ON documents(content_hash);
Production Checklist
- Chunking tested with multiple sizes, overlap validated
- Embedding model pinned to specific version
- Hybrid search enabled (vector + BM25)
- Reranker in place after retrieval
- Similarity threshold set (discard low-confidence results)
- Source attribution in generated answers
- Golden test set with automated evaluation
- Monitoring: retrieval latency, rerank latency, relevance scores
- Re-embedding pipeline for model updates
- Rate limiting and caching for embedding API calls
Anti-Patterns to Flag
- Sending entire documents to the LLM instead of relevant chunks
- No reranking — relying on raw vector similarity alone
- Chunks too large (>1024 tokens) or too small (<100 tokens)
- No overlap between chunks — splitting mid-paragraph
- Missing metadata on chunks (no way to trace back to source)
- Hardcoding chunk size without testing retrieval quality
- Not evaluating retrieval separately from generation
- Using retrieval results without a similarity threshold
Weekly Installs
4
Repository
ai-engineer-age…r-skillsFirst Seen
Feb 24, 2026
Security Audits
Installed on
opencode4
gemini-cli4
claude-code4
codex4
kiro-cli4
cursor4