rag-infrastructure
SKILL.md
RAG Infrastructure
Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.
When to Use This Skill
Use this skill when:
- Building a knowledge base Q&A system over internal documents
- Implementing semantic search over large document collections
- Reducing LLM hallucinations with retrieved context
- Setting up embedding pipelines and vector store infrastructure
- Deploying hybrid search (dense + sparse/BM25)
Prerequisites
- Python 3.10+ with
pip - A vector database (Qdrant, Weaviate, Pinecone, or pgvector)
- An embedding model (OpenAI, Cohere, or local via
sentence-transformers) - An LLM endpoint (OpenAI API or self-hosted vLLM)
- Docker for local vector DB deployment
Architecture Overview
Documents → Chunker → Embedder → Vector Store
↓
User Query → Embedder → Vector Store (search) → Reranker → LLM → Answer
Embedding Pipeline
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid
# Local embedding model (no API cost)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Connect to Qdrant
client = QdrantClient("http://localhost:6333")
# Create collection
client.create_collection(
collection_name="knowledge-base",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
def ingest_documents(docs: list[dict]):
"""Chunk, embed, and upsert documents."""
points = []
for doc in docs:
chunks = chunk_text(doc["text"], chunk_size=512, overlap=50)
embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True)
for chunk, embedding in zip(chunks, embeddings):
points.append(PointStruct(
id=str(uuid.uuid4()),
vector=embedding.tolist(),
payload={"text": chunk, "source": doc["source"], "title": doc["title"]},
))
client.upsert(collection_name="knowledge-base", points=points)
print(f"Ingested {len(points)} chunks")
Chunking Strategies
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
"""Recursive character splitter — best general-purpose strategy."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
return splitter.split_text(text)
# For code/markdown — use language-aware splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [("#", "H1"), ("##", "H2"), ("###", "H3")]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
Hybrid Search (Dense + Sparse)
from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector
from fastembed import SparseTextEmbedding
# Qdrant hybrid collection (dense + BM25 sparse)
client.create_collection(
collection_name="hybrid-kb",
vectors_config={"dense": VectorParams(size=1024, distance=Distance.COSINE)},
sparse_vectors_config={"sparse": SparseVectorParams()},
)
sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")
def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
dense_vec = model.encode(query).tolist()
sparse_vec = list(sparse_model.embed(query))[0]
results = client.query_points(
collection_name="hybrid-kb",
prefetch=[
{"query": dense_vec, "using": "dense", "limit": 20},
{"query": SparseVector(indices=sparse_vec.indices.tolist(),
values=sparse_vec.values.tolist()),
"using": "sparse", "limit": 20},
],
query={"fusion": "rrf"}, # Reciprocal Rank Fusion
limit=top_k,
)
return [{"text": p.payload["text"], "score": p.score} for p in results.points]
Reranking
import cohere
co = cohere.Client("your-api-key")
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
"""Rerank retrieved chunks for relevance (improves RAG quality ~20-30%)."""
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=candidates,
top_n=top_n,
)
return [candidates[r.index] for r in response.results]
# Alternative: local reranker (no API cost)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def local_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
pairs = [[query, c] for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [text for text, _ in ranked[:top_n]]
RAG Query Pipeline
from openai import OpenAI
llm = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")
def rag_query(user_question: str) -> str:
# 1. Retrieve
candidates = hybrid_search(user_question, top_k=20)
texts = [c["text"] for c in candidates]
# 2. Rerank
top_chunks = local_rerank(user_question, texts, top_n=5)
# 3. Generate
context = "\n\n---\n\n".join(top_chunks)
response = llm.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": (
"Answer the question using only the provided context. "
"If the answer isn't in the context, say so.\n\nContext:\n" + context
)},
{"role": "user", "content": user_question},
],
temperature=0.1,
max_tokens=1024,
)
return response.choices[0].message.content
Docker Compose: Full RAG Stack
services:
qdrant:
image: qdrant/qdrant:latest
volumes:
- qdrant-data:/qdrant/storage
ports:
- "6333:6333"
restart: unless-stopped
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
restart: unless-stopped
ingestion-worker:
build: ./ingestion
environment:
- QDRANT_URL=http://qdrant:6333
- REDIS_URL=redis://redis:6379
depends_on: [qdrant, redis]
restart: unless-stopped
rag-api:
build: ./api
ports:
- "8080:8080"
environment:
- QDRANT_URL=http://qdrant:6333
- LLM_BASE_URL=http://vllm:8000/v1
depends_on: [qdrant]
restart: unless-stopped
volumes:
qdrant-data:
redis-data:
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| Poor retrieval quality | Chunk size too large | Try 256–512 tokens; overlap 10–15% |
| LLM ignores retrieved context | Context too long | Rerank and keep top 3–5 chunks |
| Slow ingestion | Sequential embedding | Use batch_size=64 and async upserts |
| Stale documents | No re-ingestion pipeline | Track doc_hash; re-embed on change |
| High embedding costs | All chunks re-embedded | Cache embeddings with hash-based dedup |
Best Practices
- Use
BAAI/bge-large-en-v1.5ornomic-embed-textfor strong free embeddings. - Always rerank before passing to LLM — 5 precise chunks beat 20 noisy ones.
- Store source metadata (URL, page, section) in vector payloads for citations.
- Use namespace/tenant isolation in the vector store for multi-tenant RAG.
- Evaluate with RAGAS metrics: faithfulness, answer relevancy, context precision.
Related Skills
- vector-database-ops - Qdrant/Weaviate management
- vllm-server - Self-hosted LLM endpoint
- ollama-stack - Local LLM for development
- ai-pipeline-orchestration - Ingestion pipelines
Weekly Installs
3
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
5 days ago
Security Audits
Installed on
opencode3
antigravity3
claude-code3
github-copilot3
codex3
zencoder3