skills/ancoleman/ai-design-components/embedding-optimization

embedding-optimization

SKILL.md

Embedding Optimization

Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.

When to Use This Skill

Trigger this skill when:

  • Building RAG (Retrieval Augmented Generation) systems
  • Implementing semantic search or similarity detection
  • Optimizing embedding API costs (reducing by 70-90%)
  • Improving document retrieval quality through better chunking
  • Processing large document corpora (thousands to millions of documents)
  • Selecting between API-based vs. local embedding models

Model Selection Framework

Choose the optimal embedding model based on requirements:

Quick Recommendations:

  • Startup/MVP: all-MiniLM-L6-v2 (local, 384 dims, zero API costs)
  • Production: text-embedding-3-small (API, 1,536 dims, balanced quality/cost)
  • High Quality: text-embedding-3-large (API, 3,072 dims, premium)
  • Multilingual: multilingual-e5-base (local, 768 dims) or Cohere embed-multilingual-v3.0

For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see references/model-selection-guide.md.

Model Comparison Summary:

Model Type Dimensions Cost per 1M tokens Best For
all-MiniLM-L6-v2 Local 384 $0 (compute only) High volume, tight budgets
BGE-base-en-v1.5 Local 768 $0 (compute only) Quality + cost balance
text-embedding-3-small API 1,536 $0.02 General purpose production
text-embedding-3-large API 3,072 $0.13 Premium quality requirements
embed-multilingual-v3.0 API 1,024 $0.10 100+ language support

Chunking Strategies

Select chunking strategy based on content type and use case:

Content Type → Strategy Mapping:

  • Documentation: Recursive (heading-aware), 800 chars, 100 overlap
  • Code: Recursive (function-level), 1,000 chars, 100 overlap
  • Q&A/FAQ: Fixed-size, 500 chars, 50 overlap (precise retrieval)
  • Legal/Technical: Semantic (large), 1,500 chars, 200 overlap (context preservation)
  • Blog Posts: Semantic (paragraph), 1,000 chars, 100 overlap
  • Academic Papers: Recursive (section-aware), 1,200 chars, 150 overlap

For detailed chunking patterns, decision trees, and implementation guidance, see references/chunking-strategies.md.

Quick Start with CLI:

python scripts/chunk_document.py \
  --input document.txt \
  --content-type markdown \
  --chunk-size 800 \
  --overlap 100 \
  --output chunks.jsonl

Caching Implementation

Achieve 80-90% cost reduction through content-addressable caching.

Caching Architecture by Query Volume:

  • <10K queries/month: In-memory cache (Python lru_cache)
  • 10K-100K queries/month: Redis (fast, TTL-based expiration)
  • 100K-1M queries/month: Redis (hot) + PostgreSQL (warm)
  • >1M queries/month: Multi-tier (Redis + PostgreSQL + S3)

Production Caching with Redis:

# Embed documents with caching enabled
python scripts/cached_embedder.py \
  --model text-embedding-3-small \
  --input documents.jsonl \
  --output embeddings.npy \
  --cache-backend redis \
  --cache-ttl 2592000  # 30 days

Caching ROI Example:

  • 50,000 document chunks
  • 20% duplicate content
  • Without caching: $0.50 API cost
  • With caching (60% hit rate): $0.20 API cost
  • Savings: 60% ($0.30)

Dimensionality Trade-offs

Balance storage, search speed, and quality:

Dimensions Storage (1M vectors) Search Speed (p95) Quality Use Case
384 1.5 GB 10ms Good Large-scale search
768 3 GB 15ms High General purpose RAG
1,536 6 GB 25ms Very High High-quality retrieval
3,072 12 GB 40ms Highest Premium applications

Key Insight: For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.

Batch Processing Optimization

Maximize throughput for large-scale ingestion:

OpenAI API:

  • Batch up to 2,048 inputs per request
  • Implement rate limiting (tier-dependent: 500-5,000 RPM)
  • Use parallel requests with backoff on rate limits

Local Models (sentence-transformers):

  • GPU acceleration (CUDA, MPS for Apple Silicon)
  • Batch size tuning (32-128 based on GPU memory)
  • Multi-GPU support for maximum throughput

Expected Throughput:

  • OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
  • Local GPU (RTX 3090): 5,000-10,000 texts/minute
  • Local CPU: 100-500 texts/minute

Performance Monitoring

Track key metrics for optimization:

Critical Metrics:

  • Latency: Embedding generation time (p50, p95, p99)
  • Throughput: Embeddings per second/minute
  • Cost: API usage tracking (USD per 1K/1M tokens)
  • Cache Efficiency: Hit rate percentage

For detailed monitoring setup, metric collection patterns, and dashboarding, see references/performance-monitoring.md.

Monitor with Wrapper:

from scripts.performance_monitor import MonitoredEmbedder

monitored = MonitoredEmbedder(
    embedder=your_embedder,
    cost_per_1k_tokens=0.00002  # OpenAI pricing
)

embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")

Working Examples

See examples/ directory for complete implementations:

Python Examples:

  • examples/openai_cached.py - OpenAI embeddings with Redis caching
  • examples/local_embedder.py - sentence-transformers local embedding
  • examples/smart_chunker.py - Content-aware recursive chunking
  • examples/performance_monitor.py - Pipeline performance tracking
  • examples/batch_processor.py - Large-scale document processing

All examples include:

  • Complete, runnable code
  • Dependency installation instructions
  • Error handling and retry logic
  • Configuration options

Integration Points

Upstream (This skill provides to):

  • Vector Databases: Embeddings flow to Pinecone, Weaviate, Qdrant, pgvector
  • RAG Systems: Optimized embeddings for retrieval pipelines
  • Semantic Search: Query and document embeddings for similarity search

Downstream (This skill uses from):

  • Document Processing: Chunk documents before embedding
  • Data Ingestion: Process documents from various sources

Related Skills:

  • For RAG architecture, see building-ai-chat skill
  • For vector database operations, see databases-vector skill
  • For data ingestion pipelines, see ingesting-data skill

Common Patterns

Pattern 1: RAG Pipeline

Document → Chunk → Embed → Store (vector DB) → Retrieve

Pattern 2: Semantic Search

Query → Embed → Search (vector DB) → Rank → Display

Pattern 3: Multi-Stage Retrieval (Cost Optimization)

Query → Cheap Embedding (384d) → Initial Search →
Expensive Embedding (1,536d) → Rerank Top-K → Return

Cost Savings: 70% reduction vs. single-stage with expensive embeddings

Quick Reference Checklist

Model Selection:

  • Identified data privacy requirements (local vs. API)
  • Calculated expected query volume
  • Determined quality requirements (good/high/highest)
  • Checked multilingual support needs

Chunking:

  • Analyzed content type (code, docs, legal, etc.)
  • Selected appropriate chunk size (500-1,500 chars)
  • Set overlap to prevent context loss (50-200 chars)
  • Validated chunks preserve semantic boundaries

Caching:

  • Implemented content-addressable hashing
  • Selected cache backend (Redis, PostgreSQL)
  • Set TTL based on content volatility
  • Monitoring cache hit rate (target: >60%)

Performance:

  • Tracking latency (embedding generation time)
  • Measuring throughput (embeddings/sec)
  • Monitoring costs (USD spent on API calls)
  • Optimizing batch sizes for maximum efficiency
Weekly Installs
20
GitHub Stars
310
First Seen
Jan 25, 2026
Installed on
opencode19
gemini-cli19
github-copilot19
cursor19
codex18
amp16