Silent Failure Detection

Detect when LLM agents fail silently - appearing to work while producing incorrect results.

Overview

Detecting when agents skip expected tool calls
Identifying gibberish or degraded output quality
Monitoring for infinite loops and token consumption spikes
Setting up statistical baselines for anomaly detection
Alerting on non-error failures (service up but logic broken)

Quick Reference

Tool Skipping Detection

from langfuse import Langfuse

def check_tool_usage(trace_id: str, expected_tools: list[str]) -> dict:
    """
    Detect when agent skips expected tool calls.

    Based on Akamai's middleware bug: agents stopped using tools
    when hidden middleware injected unexpected instructions.
    """
    langfuse = Langfuse()
    trace = langfuse.fetch_trace(trace_id)

    # Extract tool calls from trace
    actual_tools = [
        span.name for span in trace.observations
        if span.type == "tool"
    ]

    missing_tools = set(expected_tools) - set(actual_tools)

    if missing_tools:
        return {
            "alert": True,
            "type": "tool_skipping",
            "missing": list(missing_tools),
            "message": f"Agent skipped expected tools: {missing_tools}"
        }
    return {"alert": False}

Gibberish/Quality Detection

from langfuse.decorators import observe, langfuse_context

@observe(name="quality_check")
async def detect_gibberish(response: str) -> dict:
    """
    Detect low-quality or gibberish outputs using LLM-as-judge.
    """
    # Quick heuristics first
    if len(response) < 10:
        return {"alert": True, "type": "too_short"}

    if len(set(response.split())) / len(response.split()) < 0.3:
        return {"alert": True, "type": "repetitive"}

    # LLM-as-judge for quality
    judge_prompt = f"""
    Rate this response quality (0-1):
    - 0: Gibberish, nonsensical, or completely wrong
    - 0.5: Partially correct but missing key information
    - 1: High quality, accurate, complete

    Response: {response[:1000]}

    Score (just the number):
    """

    score = await llm.generate(judge_prompt)
    score_value = float(score.strip())

    langfuse_context.score(name="quality_check", value=score_value)

    if score_value < 0.5:
        return {"alert": True, "type": "low_quality", "score": score_value}
    return {"alert": False, "score": score_value}

Loop Detection

class LoopDetector:
    """Detect infinite loops and token consumption spikes."""

    def __init__(
        self,
        max_iterations: int = 10,
        token_spike_multiplier: float = 3.0,
        baseline_tokens: int = 2000
    ):
        self.max_iterations = max_iterations
        self.token_spike_multiplier = token_spike_multiplier
        self.baseline_tokens = baseline_tokens
        self.iteration_count = 0
        self.total_tokens = 0

    def check(self, tokens_used: int) -> dict:
        self.iteration_count += 1
        self.total_tokens += tokens_used

        # Check iteration count
        if self.iteration_count > self.max_iterations:
            return {
                "alert": True,
                "type": "max_iterations",
                "iterations": self.iteration_count,
                "message": f"Agent exceeded {self.max_iterations} iterations"
            }

        # Check token spike
        expected_tokens = self.baseline_tokens * self.iteration_count
        if self.total_tokens > expected_tokens * self.token_spike_multiplier:
            return {
                "alert": True,
                "type": "token_spike",
                "tokens": self.total_tokens,
                "expected": expected_tokens,
                "message": f"Token consumption spike: {self.total_tokens} vs expected {expected_tokens}"
            }

        return {"alert": False}

Statistical Baseline Anomaly Detection

import numpy as np

class BaselineAnomalyDetector:
    """Detect anomalies vs statistical baseline."""

    def __init__(self, window_size: int = 100, z_threshold: float = 3.0):
        self.window_size = window_size
        self.z_threshold = z_threshold
        self.history = []

    def add_observation(self, value: float) -> dict:
        self.history.append(value)
        if len(self.history) > self.window_size:
            self.history = self.history[-self.window_size:]

        if len(self.history) < 10:
            return {"alert": False, "reason": "insufficient_data"}

        mean = np.mean(self.history[:-1])
        std = np.std(self.history[:-1])

        if std == 0:
            return {"alert": False}

        z_score = abs(value - mean) / std

        if z_score > self.z_threshold:
            return {
                "alert": True,
                "type": "statistical_anomaly",
                "z_score": z_score,
                "value": value,
                "mean": mean,
                "std": std
            }
        return {"alert": False, "z_score": z_score}

Key Decisions

Decision	Recommendation
Detection priority	Tool skipping > Gibberish > Loops > Anomalies
Quality check	LLM-as-judge with heuristic pre-filter
Loop threshold	10 iterations or 3x baseline tokens
Anomaly threshold	Z-score > 3.0 (99.7% confidence)
Alert strategy	Alert on silent failure, not just errors

Silent Failure Types

Type	Detection Method	Alert Priority
Tool Skipping	Expected vs actual tool calls	Critical
Gibberish Output	LLM-as-judge + heuristics	High
Infinite Loop	Iteration count + token spike	Critical
Quality Degradation	Score < baseline	Medium
Latency Spike	p99 > threshold	Medium

Anti-Patterns

# ❌ NEVER assume success if no error raised
result = await agent.run()
# Missing: quality check, tool usage check

# ❌ NEVER ignore abnormal patterns
if len(response) > 0:  # "Not empty" is not "correct"
    return response

# ✅ ALWAYS validate tool usage
expected_tools = ["search", "calculate"]
tool_check = check_tool_usage(trace_id, expected_tools)
if tool_check["alert"]:
    alert(tool_check)

# ✅ ALWAYS check output quality
quality = await detect_gibberish(response)
if quality["alert"]:
    fallback_to_human_review()

Detailed Documentation

Resource	Description
references/tool-skipping-detection.md	Agent tool usage monitoring patterns
references/gibberish-detection.md	Output quality scoring, LLM-as-judge
references/loop-detection.md	Token spikes, retry patterns, circuit breakers
references/baseline-comparison.md	Statistical anomaly detection
checklists/silent-failure-setup-checklist.md	Implementation checklist

Related Skills

langfuse-observability - Trace analysis for tool usage
quality-gates - Quality threshold enforcement
observability-monitoring - General alerting patterns
advanced-guardrails - LLM output safety checks

Capability Details

tool-skipping

Keywords: tool skip, missing tool, agent tools, expected behavior Solves:

Detect when agents don't use expected tools
Monitor agent behavior consistency
Debug middleware interference (Akamai scenario)

gibberish-detection

Keywords: gibberish, nonsense, quality check, llm judge Solves:

Detect low-quality LLM outputs
Identify repetitive or nonsensical responses
Quality gate for production outputs

loop-detection

Keywords: infinite loop, retry loop, token spike, stuck agent Solves:

Detect agents stuck in loops
Monitor token consumption anomalies
Prevent runaway costs

baseline-anomaly

Keywords: anomaly, baseline, z-score, statistical, deviation Solves:

Detect deviations from normal behavior
Statistical anomaly detection
Early warning for silent failures

latency-monitoring

Keywords: latency, slow, p99, degraded, performance Solves:

Detect degraded but non-failing service
Monitor response time anomalies
SLO compliance for LLM calls

silent-failure-detection