drift-detection

SKILL.md

Drift Detection

Monitor LLM quality degradation and input/output distribution shifts in production.

Overview

  • Detecting input distribution drift (data drift)
  • Monitoring output quality degradation (concept drift)
  • Implementing statistical methods (PSI, KS, KL divergence)
  • Setting up dynamic thresholds with moving averages
  • Integrating Langfuse scores with drift analysis

Quick Reference

Population Stability Index (PSI)

import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
    """
    Calculate Population Stability Index.

    Thresholds:
    - PSI < 0.1: No significant drift
    - 0.1 <= PSI < 0.25: Moderate drift, investigate
    - PSI >= 0.25: Significant drift, action needed
    """
    expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)

    # Avoid division by zero
    expected_pct = np.clip(expected_pct, 0.0001, None)
    actual_pct = np.clip(actual_pct, 0.0001, None)

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

# Usage
psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
    alert("Significant quality drift detected!")

EWMA Dynamic Threshold

class EWMADriftDetector:
    """Exponential Weighted Moving Average for drift detection."""

    def __init__(self, lambda_param: float = 0.2, L: float = 3.0):
        self.lambda_param = lambda_param  # Smoothing factor
        self.L = L  # Control limit multiplier
        self.ewma = None

    def update(self, value: float, baseline_mean: float, baseline_std: float) -> dict:
        if self.ewma is None:
            self.ewma = value
        else:
            self.ewma = self.lambda_param * value + (1 - self.lambda_param) * self.ewma

        # Calculate control limits
        factor = np.sqrt(self.lambda_param / (2 - self.lambda_param))
        ucl = baseline_mean + self.L * baseline_std * factor
        lcl = baseline_mean - self.L * baseline_std * factor

        return {
            "ewma": self.ewma,
            "ucl": ucl,
            "lcl": lcl,
            "drift_detected": self.ewma > ucl or self.ewma < lcl
        }

Langfuse Score Trend Monitoring

from langfuse import Langfuse

langfuse = Langfuse()

def check_quality_drift(days: int = 7, threshold_drop: float = 0.1):
    """Compare recent quality scores against baseline."""

    # Fetch recent scores
    current_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=1)
    )

    # Fetch baseline scores
    baseline_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=days),
        to_timestamp=datetime.now() - timedelta(days=1)
    )

    current_mean = np.mean([s.value for s in current_scores])
    baseline_mean = np.mean([s.value for s in baseline_scores])

    drift_pct = (baseline_mean - current_mean) / baseline_mean

    if drift_pct > threshold_drop:
        return {"drift": True, "drop_pct": drift_pct}
    return {"drift": False, "drop_pct": drift_pct}

Key Decisions

Decision Recommendation
Statistical method PSI for production (stable), KS for small samples
Threshold strategy Dynamic (95th percentile of historical) over static
Baseline window 7-30 days rolling window
Alert priority Performance metrics > distribution metrics
Tool stack Langfuse (traces) + Evidently/Phoenix (drift analysis)

PSI Threshold Guidelines

PSI Value Interpretation Action
< 0.1 No significant drift Monitor
0.1 - 0.25 Moderate drift Investigate
>= 0.25 Significant drift Alert + Action

Anti-Patterns

# ❌ NEVER use static thresholds without context
if psi > 0.2:  # May cause alert fatigue
    alert()

# ❌ NEVER retrain on time schedule alone
schedule.every(7).days.do(retrain)  # Wasteful if no drift

# ✅ ALWAYS use dynamic thresholds
threshold = np.percentile(historical_psi, 95)
if psi > threshold:
    alert()

# ✅ ALWAYS correlate with performance metrics
if psi > threshold AND quality_score < baseline:
    trigger_evaluation()

Detailed Documentation

Resource Description
references/statistical-methods.md PSI, KS, KL divergence, Wasserstein comparison
references/embedding-drift.md Arize Phoenix, cluster monitoring, semantic drift
references/ewma-baselines.md Moving averages, dynamic thresholds, control charts
references/langfuse-evidently-integration.md Combined pipeline pattern
checklists/drift-detection-setup-checklist.md Implementation checklist

Related Skills

  • langfuse-observability - Score tracking for drift analysis
  • llm-evaluation - Quality metrics that feed drift detection
  • quality-gates - Threshold enforcement
  • observability-monitoring - General monitoring patterns

Capability Details

psi-drift

Keywords: psi, population stability index, distribution drift, histogram comparison Solves:

  • Detect distribution shifts in LLM inputs/outputs
  • Production-grade drift monitoring
  • Stable drift metric for large datasets

embedding-drift

Keywords: embedding drift, semantic drift, cluster, centroid, arize phoenix Solves:

  • Detect semantic changes in text data
  • Monitor RAG retrieval quality
  • Track embedding space shifts

quality-regression

Keywords: quality drift, score degradation, trend, moving average Solves:

  • Detect LLM quality degradation over time
  • Compare against historical baselines
  • Early warning for model issues

dynamic-thresholds

Keywords: ewma, dynamic threshold, adaptive, control chart Solves:

  • Reduce alert fatigue with adaptive thresholds
  • Statistical process control for LLMs
  • Context-aware drift alerting

canary-monitoring

Keywords: canary prompt, fixed test, regression test, behavioral drift Solves:

  • Track consistency with fixed test inputs
  • Detect behavioral changes in LLMs
  • Regression testing for model updates
Weekly Installs
7
GitHub Stars
94
First Seen
Feb 2, 2026
Installed on
claude-code5
opencode4
gemini-cli4
antigravity4
github-copilot4
replit3