rag-engineer
RAG Engineering
Build Retrieval-Augmented Generation pipelines: chunk, embed, store, retrieve, generate.
Pipeline Overview
Documents → Chunk → Embed → Store in Vector DB → Query → Retrieve → Generate Answer
Document Chunking
# Simple fixed-size chunking with overlap
def chunk_text(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
# Semantic chunking (by paragraph/section)
def chunk_by_paragraphs(text, max_size=1000):
paragraphs = text.split('\n\n')
chunks, current = [], ""
for p in paragraphs:
if len(current) + len(p) > max_size and current:
chunks.append(current.strip())
current = p
else:
current += "\n\n" + p
if current:
chunks.append(current.strip())
return chunks
Generate Embeddings
OpenAI
curl -s https://api.openai.com/v1/embeddings \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "Your text to embed", "model": "text-embedding-3-small"}' \
| jq '.data[0].embedding[:5]'
Python with OpenAI
from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
return client.embeddings.create(input=text, model=model).data[0].embedding
# Batch embed
texts = ["chunk 1", "chunk 2", "chunk 3"]
response = client.embeddings.create(input=texts, model="text-embedding-3-small")
embeddings = [d.embedding for d in response.data]
Vector Storage
Supabase (pgvector)
# Create table
psql $DATABASE_URL -c "
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536),
metadata JSONB
);
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
"
# Query similar documents
psql $DATABASE_URL -c "
SELECT content, 1 - (embedding <=> '[0.1, 0.2, ...]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 5;
"
ChromaDB (Local)
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
# Add documents
collection.add(
documents=["chunk 1", "chunk 2"],
ids=["id1", "id2"],
metadatas=[{"source": "file1"}, {"source": "file2"}]
)
# Query
results = collection.query(query_texts=["search query"], n_results=5)
print(results["documents"])
Retrieval Strategies
Basic similarity search
Embed the query, find top-K nearest chunks.
Hybrid search (keyword + semantic)
Combine BM25 text search with vector similarity. Weight and merge results.
Re-ranking
Retrieve top-20, then re-rank with a cross-encoder model to get top-5.
Contextual retrieval
Prepend document-level context to each chunk before embedding:
"This chunk is from the API documentation, section: Authentication. "
+ original chunk text
Answer Generation
# Claude with retrieved context
curl -s https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"system": "Answer questions using only the provided context. If the context does not contain the answer, say so.",
"messages": [{"role": "user", "content": "Context:\n[retrieved chunks here]\n\nQuestion: How do I authenticate?"}]
}' | jq '.content[0].text'
Notes
- Chunk size matters. Too small = lost context. Too large = noise in retrieval. 300-800 tokens is typical.
- Overlap prevents splitting key information across chunk boundaries.
- Embedding model choice affects quality significantly.
text-embedding-3-smallis cost-effective;text-embedding-3-largeis higher quality. - Always include source metadata with chunks so you can cite references.
- Test retrieval quality before building the full pipeline — if retrieval is bad, generation can't fix it.
More from thinkfleetai/thinkfleet-engine
local-whisper
Local speech-to-text using OpenAI Whisper. Runs fully offline after model download. High quality transcription with multiple model sizes.
148flyio-cli-public
Use the Fly.io flyctl CLI for deploying and operating apps on Fly.io: deploys (local or remote builder), viewing status/logs, SSH/console, secrets/config, scaling, machines, volumes, and Fly Postgres (create/attach/manage databases). Use when asked to deploy to Fly.io, debug fly deploy/build/runtime failures, set up GitHub Actions deploys/previews, or safely manage Fly apps and Postgres.
24kagi-search
Web search using Kagi Search API. Use when you need to search the web for current information, facts, or references. Requires KAGI_API_KEY in the environment.
22feishu-bridge
Connect a Feishu (Lark) bot to ThinkFleet via WebSocket long-connection. No public server, domain, or ngrok required. Use when setting up Feishu/Lark as a messaging channel, troubleshooting the Feishu bridge, or managing the bridge service (start/stop/logs). Covers bot creation on Feishu Open Platform, credential setup, bridge startup, macOS launchd auto-restart, and group chat behavior tuning.
13bambu-local
Control Bambu Lab 3D printers locally via MQTT (no cloud). Supports A1, A1 Mini, P1P, P1S, X1C.
10voice-transcribe
Transcribe audio files using OpenAI's gpt-4o-mini-transcribe model with vocabulary hints and text replacements. Requires uv (https://docs.astral.sh/uv/).
10