Semantic Caching

Cache LLM responses by semantic similarity.

Redis 8 Note: Redis 8+ includes Search, JSON, TimeSeries, and Bloom modules built-in. No separate Redis Stack installation is required. Use redis:8 in Docker or any Redis 8+ deployment.

Cache Hierarchy

Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM)
           ~1ms         ~10ms           ~2s          ~3s
         100% save    100% save       90% save    Full cost

Redis Semantic Cache

from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

class SemanticCacheService:
    def __init__(self, redis_url: str, threshold: float = 0.92):
        self.client = Redis.from_url(redis_url)
        self.threshold = threshold

    async def get(self, content: str, agent_type: str) -> dict | None:
        embedding = await embed_text(content[:2000])

        query = VectorQuery(
            vector=embedding,
            vector_field_name="embedding",
            filter_expression=f"@agent_type:{{{agent_type}}}",
            num_results=1
        )

        results = self.index.query(query)

        if results:
            distance = float(results[0].get("vector_distance", 1.0))
            if distance <= (1 - self.threshold):
                return json.loads(results[0]["response"])

        return None

    async def set(self, content: str, response: dict, agent_type: str):
        embedding = await embed_text(content[:2000])
        key = f"cache:{agent_type}:{hash_content(content)}"

        self.client.hset(key, mapping={
            "agent_type": agent_type,
            "embedding": embedding,
            "response": json.dumps(response),
            "created_at": time.time(),
        })
        self.client.expire(key, 86400)  # 24h TTL

Similarity Thresholds

Threshold	Distance	Use Case
0.98-1.00	0.00-0.02	Nearly identical
0.95-0.98	0.02-0.05	Very similar
0.92-0.95	0.05-0.08	Similar (default)
0.85-0.92	0.08-0.15	Moderately similar

Multi-Level Lookup

async def get_llm_response(query: str, agent_type: str) -> dict:
    # L1: Exact match (in-memory LRU)
    cache_key = hash_content(query)
    if cache_key in lru_cache:
        return lru_cache[cache_key]

    # L2: Semantic similarity (Redis)
    similar = await semantic_cache.get(query, agent_type)
    if similar:
        lru_cache[cache_key] = similar  # Promote to L1
        return similar

    # L3/L4: LLM call with prompt caching
    response = await llm.generate(query)

    # Store in caches
    await semantic_cache.set(query, response, agent_type)
    lru_cache[cache_key] = response

    return response

Redis 8.4+ Hybrid Search (FT.HYBRID)

Redis 8.4 introduces native hybrid search combining semantic (vector) and exact (keyword) matching in a single query. This is ideal for caches that need both similarity and metadata filtering.

# Redis 8.4 native hybrid search
result = redis.execute_command(
    "FT.HYBRID", "llm_cache",
    "SEARCH", f"@agent_type:{{{agent_type}}}",
    "VSIM", "@embedding", "$query_vec",
    "KNN", "2", "K", "5",
    "COMBINE", "RRF", "4", "CONSTANT", "60",
    "PARAMS", "2", "query_vec", embedding_bytes
)

Hybrid Search Benefits:

Single query for keyword + vector matching
RRF (Reciprocal Rank Fusion) combines scores intelligently
Better results than sequential filtering
BM25STD is now the default scorer for keyword matching

When to Use Hybrid:

Filtering by metadata (agent_type, tenant, category) + semantic similarity
Multi-tenant caches where exact tenant match is required
Combining keyword search with vector similarity

Key Decisions

Decision	Recommendation
Threshold	Start at 0.92, tune based on hit rate
TTL	24h for production
Embedding	text-embedding-3-small (fast)
L1 size	1000-10000 entries
Scorer	BM25STD (Redis 8+ default)
Hybrid	Use FT.HYBRID for metadata + vector queries

Common Mistakes

Threshold too low (false positives)
No cache warming (cold start)
Missing metadata filters
Not promoting L2 hits to L1

Related Skills

prompt-caching - Provider-native caching
embeddings - Vector generation
cache-cost-tracking - Langfuse integration

Capability Details

redis-vector-cache

Keywords: redis, vector, embedding, similarity, cache Solves:

Cache LLM responses by semantic similarity
Reduce API costs with smart caching
Implement multi-level cache hierarchy

similarity-threshold

Keywords: threshold, similarity, tuning, cosine Solves:

Set appropriate similarity threshold
Balance hit rate vs accuracy
Tune cache performance

orchestkit-integration

Keywords: orchestkit, integration, roi, cost-savings Solves:

Integrate caching with OrchestKit
Calculate ROI for caching
Production implementation guide

cache-service

Keywords: service, implementation, template, production Solves:

Production cache service template
Complete implementation example
Redis integration code

hybrid-search

Keywords: hybrid, ft.hybrid, bm25, rrf, keyword, metadata, filter Solves:

Combine semantic and keyword search
Filter cache by metadata with vector similarity
Use Redis 8.4 FT.HYBRID command
BM25STD scoring for keyword matching

semantic-caching