skills/yonatangross/orchestkit/semantic-caching

semantic-caching

SKILL.md

Semantic Caching

Cache LLM responses by semantic similarity.

Redis 8 Note: Redis 8+ includes Search, JSON, TimeSeries, and Bloom modules built-in. No separate Redis Stack installation is required. Use redis:8 in Docker or any Redis 8+ deployment.

Cache Hierarchy

Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM)
           ~1ms         ~10ms           ~2s          ~3s
         100% save    100% save       90% save    Full cost

Redis Semantic Cache

from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

class SemanticCacheService:
    def __init__(self, redis_url: str, threshold: float = 0.92):
        self.client = Redis.from_url(redis_url)
        self.threshold = threshold

    async def get(self, content: str, agent_type: str) -> dict | None:
        embedding = await embed_text(content[:2000])

        query = VectorQuery(
            vector=embedding,
            vector_field_name="embedding",
            filter_expression=f"@agent_type:{{{agent_type}}}",
            num_results=1
        )

        results = self.index.query(query)

        if results:
            distance = float(results[0].get("vector_distance", 1.0))
            if distance <= (1 - self.threshold):
                return json.loads(results[0]["response"])

        return None

    async def set(self, content: str, response: dict, agent_type: str):
        embedding = await embed_text(content[:2000])
        key = f"cache:{agent_type}:{hash_content(content)}"

        self.client.hset(key, mapping={
            "agent_type": agent_type,
            "embedding": embedding,
            "response": json.dumps(response),
            "created_at": time.time(),
        })
        self.client.expire(key, 86400)  # 24h TTL

Similarity Thresholds

Threshold Distance Use Case
0.98-1.00 0.00-0.02 Nearly identical
0.95-0.98 0.02-0.05 Very similar
0.92-0.95 0.05-0.08 Similar (default)
0.85-0.92 0.08-0.15 Moderately similar

Multi-Level Lookup

async def get_llm_response(query: str, agent_type: str) -> dict:
    # L1: Exact match (in-memory LRU)
    cache_key = hash_content(query)
    if cache_key in lru_cache:
        return lru_cache[cache_key]

    # L2: Semantic similarity (Redis)
    similar = await semantic_cache.get(query, agent_type)
    if similar:
        lru_cache[cache_key] = similar  # Promote to L1
        return similar

    # L3/L4: LLM call with prompt caching
    response = await llm.generate(query)

    # Store in caches
    await semantic_cache.set(query, response, agent_type)
    lru_cache[cache_key] = response

    return response

Redis 8.4+ Hybrid Search (FT.HYBRID)

Redis 8.4 introduces native hybrid search combining semantic (vector) and exact (keyword) matching in a single query. This is ideal for caches that need both similarity and metadata filtering.

# Redis 8.4 native hybrid search
result = redis.execute_command(
    "FT.HYBRID", "llm_cache",
    "SEARCH", f"@agent_type:{{{agent_type}}}",
    "VSIM", "@embedding", "$query_vec",
    "KNN", "2", "K", "5",
    "COMBINE", "RRF", "4", "CONSTANT", "60",
    "PARAMS", "2", "query_vec", embedding_bytes
)

Hybrid Search Benefits:

  • Single query for keyword + vector matching
  • RRF (Reciprocal Rank Fusion) combines scores intelligently
  • Better results than sequential filtering
  • BM25STD is now the default scorer for keyword matching

When to Use Hybrid:

  • Filtering by metadata (agent_type, tenant, category) + semantic similarity
  • Multi-tenant caches where exact tenant match is required
  • Combining keyword search with vector similarity

Key Decisions

Decision Recommendation
Threshold Start at 0.92, tune based on hit rate
TTL 24h for production
Embedding text-embedding-3-small (fast)
L1 size 1000-10000 entries
Scorer BM25STD (Redis 8+ default)
Hybrid Use FT.HYBRID for metadata + vector queries

Common Mistakes

  • Threshold too low (false positives)
  • No cache warming (cold start)
  • Missing metadata filters
  • Not promoting L2 hits to L1

Related Skills

  • prompt-caching - Provider-native caching
  • embeddings - Vector generation
  • cache-cost-tracking - Langfuse integration

Capability Details

redis-vector-cache

Keywords: redis, vector, embedding, similarity, cache Solves:

  • Cache LLM responses by semantic similarity
  • Reduce API costs with smart caching
  • Implement multi-level cache hierarchy

similarity-threshold

Keywords: threshold, similarity, tuning, cosine Solves:

  • Set appropriate similarity threshold
  • Balance hit rate vs accuracy
  • Tune cache performance

orchestkit-integration

Keywords: orchestkit, integration, roi, cost-savings Solves:

  • Integrate caching with OrchestKit
  • Calculate ROI for caching
  • Production implementation guide

cache-service

Keywords: service, implementation, template, production Solves:

  • Production cache service template
  • Complete implementation example
  • Redis integration code

hybrid-search

Keywords: hybrid, ft.hybrid, bm25, rrf, keyword, metadata, filter Solves:

  • Combine semantic and keyword search
  • Filter cache by metadata with vector similarity
  • Use Redis 8.4 FT.HYBRID command
  • BM25STD scoring for keyword matching
Weekly Installs
12
GitHub Stars
95
First Seen
Jan 22, 2026
Installed on
claude-code9
opencode6
antigravity6
github-copilot6
codex6
gemini-cli6