Drift Detection

Monitor LLM quality degradation and input/output distribution shifts in production.

Overview

Detecting input distribution drift (data drift)
Monitoring output quality degradation (concept drift)
Implementing statistical methods (PSI, KS, KL divergence)
Setting up dynamic thresholds with moving averages
Integrating Langfuse scores with drift analysis

Quick Reference

Population Stability Index (PSI)

import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
    """
    Calculate Population Stability Index.

    Thresholds:
    - PSI < 0.1: No significant drift
    - 0.1 <= PSI < 0.25: Moderate drift, investigate
    - PSI >= 0.25: Significant drift, action needed
    """
    expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)

    # Avoid division by zero
    expected_pct = np.clip(expected_pct, 0.0001, None)
    actual_pct = np.clip(actual_pct, 0.0001, None)

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

# Usage
psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
    alert("Significant quality drift detected!")

EWMA Dynamic Threshold

class EWMADriftDetector:
    """Exponential Weighted Moving Average for drift detection."""

    def __init__(self, lambda_param: float = 0.2, L: float = 3.0):
        self.lambda_param = lambda_param  # Smoothing factor
        self.L = L  # Control limit multiplier
        self.ewma = None

    def update(self, value: float, baseline_mean: float, baseline_std: float) -> dict:
        if self.ewma is None:
            self.ewma = value
        else:
            self.ewma = self.lambda_param * value + (1 - self.lambda_param) * self.ewma

        # Calculate control limits
        factor = np.sqrt(self.lambda_param / (2 - self.lambda_param))
        ucl = baseline_mean + self.L * baseline_std * factor
        lcl = baseline_mean - self.L * baseline_std * factor

        return {
            "ewma": self.ewma,
            "ucl": ucl,
            "lcl": lcl,
            "drift_detected": self.ewma > ucl or self.ewma < lcl
        }

Langfuse Score Trend Monitoring

from langfuse import Langfuse

langfuse = Langfuse()

def check_quality_drift(days: int = 7, threshold_drop: float = 0.1):
    """Compare recent quality scores against baseline."""

    # Fetch recent scores
    current_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=1)
    )

    # Fetch baseline scores
    baseline_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=days),
        to_timestamp=datetime.now() - timedelta(days=1)
    )

    current_mean = np.mean([s.value for s in current_scores])
    baseline_mean = np.mean([s.value for s in baseline_scores])

    drift_pct = (baseline_mean - current_mean) / baseline_mean

    if drift_pct > threshold_drop:
        return {"drift": True, "drop_pct": drift_pct}
    return {"drift": False, "drop_pct": drift_pct}

Key Decisions

Decision	Recommendation
Statistical method	PSI for production (stable), KS for small samples
Threshold strategy	Dynamic (95th percentile of historical) over static
Baseline window	7-30 days rolling window
Alert priority	Performance metrics > distribution metrics
Tool stack	Langfuse (traces) + Evidently/Phoenix (drift analysis)

PSI Threshold Guidelines

PSI Value	Interpretation	Action
< 0.1	No significant drift	Monitor
0.1 - 0.25	Moderate drift	Investigate
>= 0.25	Significant drift	Alert + Action

Anti-Patterns

# ❌ NEVER use static thresholds without context
if psi > 0.2:  # May cause alert fatigue
    alert()

# ❌ NEVER retrain on time schedule alone
schedule.every(7).days.do(retrain)  # Wasteful if no drift

# ✅ ALWAYS use dynamic thresholds
threshold = np.percentile(historical_psi, 95)
if psi > threshold:
    alert()

# ✅ ALWAYS correlate with performance metrics
if psi > threshold AND quality_score < baseline:
    trigger_evaluation()

Detailed Documentation

Resource	Description
references/statistical-methods.md	PSI, KS, KL divergence, Wasserstein comparison
references/embedding-drift.md	Arize Phoenix, cluster monitoring, semantic drift
references/ewma-baselines.md	Moving averages, dynamic thresholds, control charts
references/langfuse-evidently-integration.md	Combined pipeline pattern
checklists/drift-detection-setup-checklist.md	Implementation checklist

Related Skills

langfuse-observability - Score tracking for drift analysis
llm-evaluation - Quality metrics that feed drift detection
quality-gates - Threshold enforcement
observability-monitoring - General monitoring patterns

Capability Details

psi-drift

Keywords: psi, population stability index, distribution drift, histogram comparison Solves:

Detect distribution shifts in LLM inputs/outputs
Production-grade drift monitoring
Stable drift metric for large datasets

embedding-drift

Keywords: embedding drift, semantic drift, cluster, centroid, arize phoenix Solves:

Detect semantic changes in text data
Monitor RAG retrieval quality
Track embedding space shifts

quality-regression

Keywords: quality drift, score degradation, trend, moving average Solves:

Detect LLM quality degradation over time
Compare against historical baselines
Early warning for model issues

dynamic-thresholds

Keywords: ewma, dynamic threshold, adaptive, control chart Solves:

Reduce alert fatigue with adaptive thresholds
Statistical process control for LLMs
Context-aware drift alerting

canary-monitoring

Keywords: canary prompt, fixed test, regression test, behavioral drift Solves:

Track consistency with fixed test inputs
Detect behavioral changes in LLMs
Regression testing for model updates

drift-detection