skills/yonatangross/orchestkit/silent-failure-detection

silent-failure-detection

SKILL.md

Silent Failure Detection

Detect when LLM agents fail silently - appearing to work while producing incorrect results.

Overview

  • Detecting when agents skip expected tool calls
  • Identifying gibberish or degraded output quality
  • Monitoring for infinite loops and token consumption spikes
  • Setting up statistical baselines for anomaly detection
  • Alerting on non-error failures (service up but logic broken)

Quick Reference

Tool Skipping Detection

from langfuse import Langfuse

def check_tool_usage(trace_id: str, expected_tools: list[str]) -> dict:
    """
    Detect when agent skips expected tool calls.

    Based on Akamai's middleware bug: agents stopped using tools
    when hidden middleware injected unexpected instructions.
    """
    langfuse = Langfuse()
    trace = langfuse.fetch_trace(trace_id)

    # Extract tool calls from trace
    actual_tools = [
        span.name for span in trace.observations
        if span.type == "tool"
    ]

    missing_tools = set(expected_tools) - set(actual_tools)

    if missing_tools:
        return {
            "alert": True,
            "type": "tool_skipping",
            "missing": list(missing_tools),
            "message": f"Agent skipped expected tools: {missing_tools}"
        }
    return {"alert": False}

Gibberish/Quality Detection

from langfuse.decorators import observe, langfuse_context

@observe(name="quality_check")
async def detect_gibberish(response: str) -> dict:
    """
    Detect low-quality or gibberish outputs using LLM-as-judge.
    """
    # Quick heuristics first
    if len(response) < 10:
        return {"alert": True, "type": "too_short"}

    if len(set(response.split())) / len(response.split()) < 0.3:
        return {"alert": True, "type": "repetitive"}

    # LLM-as-judge for quality
    judge_prompt = f"""
    Rate this response quality (0-1):
    - 0: Gibberish, nonsensical, or completely wrong
    - 0.5: Partially correct but missing key information
    - 1: High quality, accurate, complete

    Response: {response[:1000]}

    Score (just the number):
    """

    score = await llm.generate(judge_prompt)
    score_value = float(score.strip())

    langfuse_context.score(name="quality_check", value=score_value)

    if score_value < 0.5:
        return {"alert": True, "type": "low_quality", "score": score_value}
    return {"alert": False, "score": score_value}

Loop Detection

class LoopDetector:
    """Detect infinite loops and token consumption spikes."""

    def __init__(
        self,
        max_iterations: int = 10,
        token_spike_multiplier: float = 3.0,
        baseline_tokens: int = 2000
    ):
        self.max_iterations = max_iterations
        self.token_spike_multiplier = token_spike_multiplier
        self.baseline_tokens = baseline_tokens
        self.iteration_count = 0
        self.total_tokens = 0

    def check(self, tokens_used: int) -> dict:
        self.iteration_count += 1
        self.total_tokens += tokens_used

        # Check iteration count
        if self.iteration_count > self.max_iterations:
            return {
                "alert": True,
                "type": "max_iterations",
                "iterations": self.iteration_count,
                "message": f"Agent exceeded {self.max_iterations} iterations"
            }

        # Check token spike
        expected_tokens = self.baseline_tokens * self.iteration_count
        if self.total_tokens > expected_tokens * self.token_spike_multiplier:
            return {
                "alert": True,
                "type": "token_spike",
                "tokens": self.total_tokens,
                "expected": expected_tokens,
                "message": f"Token consumption spike: {self.total_tokens} vs expected {expected_tokens}"
            }

        return {"alert": False}

Statistical Baseline Anomaly Detection

import numpy as np

class BaselineAnomalyDetector:
    """Detect anomalies vs statistical baseline."""

    def __init__(self, window_size: int = 100, z_threshold: float = 3.0):
        self.window_size = window_size
        self.z_threshold = z_threshold
        self.history = []

    def add_observation(self, value: float) -> dict:
        self.history.append(value)
        if len(self.history) > self.window_size:
            self.history = self.history[-self.window_size:]

        if len(self.history) < 10:
            return {"alert": False, "reason": "insufficient_data"}

        mean = np.mean(self.history[:-1])
        std = np.std(self.history[:-1])

        if std == 0:
            return {"alert": False}

        z_score = abs(value - mean) / std

        if z_score > self.z_threshold:
            return {
                "alert": True,
                "type": "statistical_anomaly",
                "z_score": z_score,
                "value": value,
                "mean": mean,
                "std": std
            }
        return {"alert": False, "z_score": z_score}

Key Decisions

Decision Recommendation
Detection priority Tool skipping > Gibberish > Loops > Anomalies
Quality check LLM-as-judge with heuristic pre-filter
Loop threshold 10 iterations or 3x baseline tokens
Anomaly threshold Z-score > 3.0 (99.7% confidence)
Alert strategy Alert on silent failure, not just errors

Silent Failure Types

Type Detection Method Alert Priority
Tool Skipping Expected vs actual tool calls Critical
Gibberish Output LLM-as-judge + heuristics High
Infinite Loop Iteration count + token spike Critical
Quality Degradation Score < baseline Medium
Latency Spike p99 > threshold Medium

Anti-Patterns

# ❌ NEVER assume success if no error raised
result = await agent.run()
# Missing: quality check, tool usage check

# ❌ NEVER ignore abnormal patterns
if len(response) > 0:  # "Not empty" is not "correct"
    return response

# ✅ ALWAYS validate tool usage
expected_tools = ["search", "calculate"]
tool_check = check_tool_usage(trace_id, expected_tools)
if tool_check["alert"]:
    alert(tool_check)

# ✅ ALWAYS check output quality
quality = await detect_gibberish(response)
if quality["alert"]:
    fallback_to_human_review()

Detailed Documentation

Resource Description
references/tool-skipping-detection.md Agent tool usage monitoring patterns
references/gibberish-detection.md Output quality scoring, LLM-as-judge
references/loop-detection.md Token spikes, retry patterns, circuit breakers
references/baseline-comparison.md Statistical anomaly detection
checklists/silent-failure-setup-checklist.md Implementation checklist

Related Skills

  • langfuse-observability - Trace analysis for tool usage
  • quality-gates - Quality threshold enforcement
  • observability-monitoring - General alerting patterns
  • advanced-guardrails - LLM output safety checks

Capability Details

tool-skipping

Keywords: tool skip, missing tool, agent tools, expected behavior Solves:

  • Detect when agents don't use expected tools
  • Monitor agent behavior consistency
  • Debug middleware interference (Akamai scenario)

gibberish-detection

Keywords: gibberish, nonsense, quality check, llm judge Solves:

  • Detect low-quality LLM outputs
  • Identify repetitive or nonsensical responses
  • Quality gate for production outputs

loop-detection

Keywords: infinite loop, retry loop, token spike, stuck agent Solves:

  • Detect agents stuck in loops
  • Monitor token consumption anomalies
  • Prevent runaway costs

baseline-anomaly

Keywords: anomaly, baseline, z-score, statistical, deviation Solves:

  • Detect deviations from normal behavior
  • Statistical anomaly detection
  • Early warning for silent failures

latency-monitoring

Keywords: latency, slow, p99, degraded, performance Solves:

  • Detect degraded but non-failing service
  • Monitor response time anomalies
  • SLO compliance for LLM calls
Weekly Installs
7
GitHub Stars
94
First Seen
Feb 2, 2026
Installed on
claude-code5
opencode4
gemini-cli4
antigravity4
github-copilot4
windsurf3