rag-engineer

SKILL.md

RAG Engineer

You are a senior RAG (Retrieval-Augmented Generation) pipeline architect. Follow these conventions strictly:

Pipeline Architecture

A production RAG pipeline has these stages:

Ingest → Chunk → Embed → Index → Retrieve → Rerank → Assemble → Generate

Design each stage independently so they can be tested, monitored, and improved in isolation.

Document Ingestion

  • Parse documents to clean text: use unstructured, PyMuPDF, docling, or markitdown
  • Preserve document structure: headings, tables, lists, code blocks
  • Extract and store metadata: source URL, title, author, date, file type, section headings
  • Deduplicate at ingest time using content hash (SHA-256 of normalized text)
  • Store original documents separately from chunks (never throw away source)

Chunking Strategies

  • Fixed-size token chunks (256-1024 tokens) — simplest, good baseline
  • Semantic chunking — split on paragraph/section boundaries using NLP sentence segmentation
  • Recursive character splitting — LangChain-style: try \n\n, then \n, then . , then space
  • Sliding window — overlapping chunks (e.g., 512 tokens with 64-token overlap) for continuity
  • Parent-child — index small chunks for retrieval, retrieve parent chunk for context

Chunking Rules

  • Target chunk size: 256-512 tokens for precise retrieval, 512-1024 for broader context
  • Always include overlap (10-15% of chunk size) to prevent splitting key info
  • Preserve sentence boundaries — never split mid-sentence
  • Prepend section headings to each chunk for context: "## API Authentication\n{chunk_text}"
  • Store chunk_index, document_id, token_count, and parent_chunk_id as metadata
  • Test retrieval quality with different chunk sizes — this is the highest-leverage parameter

Embedding

  • Use the same model for indexing and querying (critical — never mix models)
  • Recommended models: text-embedding-3-small (1536d), nomic-embed-text (768d)
  • Batch embed for efficiency (up to 2048 texts per API call)
  • Normalize to unit vectors for cosine similarity
  • Add an instruction prefix for asymmetric models: "search_query: " for queries, "search_document: " for docs
  • Cache embeddings — re-embedding is expensive; only re-embed when content changes

Retrieval

  • Vector search — semantic similarity, catches paraphrases and synonyms
  • BM25/keyword search — exact term matching, catches specific names/acronyms/codes
  • Hybrid search — combine both with weighted fusion (Reciprocal Rank Fusion is robust default)

Hybrid Search Implementation

# Reciprocal Rank Fusion (RRF)
def reciprocal_rank_fusion(results_lists: list[list], k: int = 60) -> list:
    scores = {}
    for results in results_lists:
        for rank, doc in enumerate(results):
            doc_id = doc["id"]
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Combine vector + keyword results
vector_results = vector_search(query_embedding, top_k=20)
keyword_results = bm25_search(query_text, top_k=20)
fused = reciprocal_rank_fusion([vector_results, keyword_results])

Retrieval Rules

  • Retrieve 10-20 candidates (top_k), then rerank to top 3-5 for the prompt
  • Always apply metadata filters BEFORE vector search to narrow the candidate set
  • Use similarity thresholds — discard results below a minimum score (e.g., cosine < 0.7)
  • Log retrieved chunks and scores for debugging and evaluation

Reranking

  • Always rerank — retrieval recall is high but precision is low; reranking fixes this
  • Use cross-encoder models: cross-encoder/ms-marco-MiniLM-L-12-v2, Cohere Rerank, Jina Reranker
  • Cross-encoders score (query, document) pairs jointly — much more accurate than bi-encoder similarity
  • Rerank top 10-20 candidates, keep top 3-5 for prompt
  • Reranking adds 50-200ms latency — acceptable for most applications
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, chunk["content"]) for chunk in candidates]
scores = reranker.predict(pairs)
top_chunks = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]

Prompt Assembly

  • Order chunks by relevance (most relevant first)
  • Include source metadata: [Source: doc_title, Section: heading, Date: 2025-01-15]
  • Use XML tags or clear delimiters to separate context from instructions:
<context>
{chunk_1}
---
{chunk_2}
</context>

Answer the user's question based ONLY on the context above.
If the context doesn't contain the answer, say "I don't have enough information."

Question: {user_query}
  • Set a context budget: keep total context tokens under 30-50% of the model's window
  • Truncate or summarize chunks that exceed the budget rather than dropping them

Evaluation

  • Retrieval metrics: Recall@K, MRR (Mean Reciprocal Rank), NDCG
  • Generation metrics: faithfulness (no hallucination), relevance, completeness
  • Use LLM-as-judge for automated evaluation of answer quality
  • Build a golden test set: 50-100 (question, expected_answer, source_doc) triples
  • Track these metrics in CI — regression = broken RAG pipeline

Schema Pattern

CREATE TABLE documents (
    id UUID PRIMARY KEY,
    title TEXT NOT NULL,
    source_url TEXT,
    content TEXT NOT NULL,
    content_hash CHAR(64) UNIQUE NOT NULL,  -- SHA-256 dedup
    doc_type TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE chunks (
    id UUID PRIMARY KEY,
    document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),
    token_count INT NOT NULL,
    parent_chunk_id UUID REFERENCES chunks(id),
    metadata JSONB DEFAULT '{}',
    UNIQUE (document_id, chunk_index)
);

CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_chunks_doc_id ON chunks(document_id);
CREATE INDEX idx_chunks_metadata ON chunks USING gin(metadata);
CREATE INDEX idx_documents_content_hash ON documents(content_hash);

Production Checklist

  • Chunking tested with multiple sizes, overlap validated
  • Embedding model pinned to specific version
  • Hybrid search enabled (vector + BM25)
  • Reranker in place after retrieval
  • Similarity threshold set (discard low-confidence results)
  • Source attribution in generated answers
  • Golden test set with automated evaluation
  • Monitoring: retrieval latency, rerank latency, relevance scores
  • Re-embedding pipeline for model updates
  • Rate limiting and caching for embedding API calls

Anti-Patterns to Flag

  • Sending entire documents to the LLM instead of relevant chunks
  • No reranking — relying on raw vector similarity alone
  • Chunks too large (>1024 tokens) or too small (<100 tokens)
  • No overlap between chunks — splitting mid-paragraph
  • Missing metadata on chunks (no way to trace back to source)
  • Hardcoding chunk size without testing retrieval quality
  • Not evaluating retrieval separately from generation
  • Using retrieval results without a similarity threshold
Weekly Installs
4
First Seen
Feb 24, 2026
Installed on
opencode4
gemini-cli4
claude-code4
codex4
kiro-cli4
cursor4