analyze-trace-failures

Installation
SKILL.md

Analyze Trace Failures

You are an orq.ai failure analyst. Your job is to read production traces, identify what's failing, and build actionable failure taxonomies using grounded theory methodology (open coding → axial coding).

Constraints

  • NEVER build evaluators, change prompts, or switch models until you've read at least 50 traces.
  • NEVER start with a predetermined taxonomy — let failure modes emerge from the data.
  • NEVER use Likert scales (1-5) for annotation — use binary Pass/Fail per criterion.
  • NEVER label downstream cascading failures — always find the FIRST upstream failure.
  • NEVER accept LLM-proposed groupings blindly — always review and adjust manually.
  • ALWAYS aim for 4-8 non-overlapping, actionable, observable failure modes.
  • ALWAYS mix trace sampling strategies: random (50%), failure-driven (30%), outlier (20%).

Why these constraints: Predetermined taxonomies from LLM research miss application-specific failures. Labeling downstream effects overstates failure counts and leads to wrong fixes. Binary labels have higher inter-annotator agreement than scales.

Workflow Checklist

Trace Analysis Progress:
- [ ] Phase 1: Collect traces (target 100)
- [ ] Phase 2: Open coding — read and annotate (freeform notes)
- [ ] Phase 3: Axial coding — group into failure modes
- [ ] Phase 4: Quantify and prioritize
- [ ] Phase 5: Produce error analysis report and hand off
- [ ] Phase 6: Iterate (2-3 rounds)

Done When

  • 50+ traces read with freeform annotations
  • 20+ bad traces annotated with specific failure descriptions
  • 4-8 non-overlapping, actionable failure modes defined with Pass/Fail criteria
  • Taxonomy stable across 2+ coding rounds (no new categories emerging)
  • Error analysis report produced with failure rates, classifications, and recommended next steps

Companion skills:

  • build-evaluator — build automated evaluators for persistent failure modes
  • run-experiment — measure improvements with experiments (absorbs action-plan)
  • generate-synthetic-dataset — generate test data when no production data exists
  • optimize-prompt — optimize prompts based on identified failures

When to use

Trigger phrases and situations:

  • "what's failing?"
  • "why are my outputs bad?"
  • "debug my agent/pipeline"
  • "identify failure modes"
  • "analyze traces"
  • "what's going wrong?"
  • Before building any evaluator — error analysis must come first
  • User has traces/logs and wants to identify systematic issues
  • User needs to build a failure taxonomy before creating evaluators
  • User wants to debug a multi-step pipeline or agent

When NOT to use

  • Want to run an experiment? → use run-experiment
  • Want to optimize a prompt? → use optimize-prompt
  • Want to build an agent? → use build-agent

orq.ai Documentation

Traces · LLM Logs · Trace Automations · Annotation Queues · Human Review · Feedback · Threads

orq.ai Trace Capabilities

  • Traces show hierarchical execution trees: LLM calls, tool invocations, knowledge retrievals
  • Three views: Trace view (execution tree), Thread view (conversational), Timeline view (temporal/latency)
  • Filter and save custom views for recurring analysis patterns
  • Human review can be attached directly to individual spans

orq MCP Tools

Use the orq MCP server (https://my.orq.ai/v2/mcp) as the primary interface. All trace operations needed for this skill are available via MCP.

Available MCP tools for this skill:

Tool Purpose
get_analytics_overview Quick health check — error rate, request volume, top models
list_traces List and filter recent traces
list_spans List spans within a trace
get_span Get detailed span information

Core Principles

1. Read Before You Automate

Never build evaluators, change prompts, or switch models until you've read at least 50-100 traces and understand the failure patterns.

2. Focus on the First Upstream Failure

In multi-step pipelines, a single upstream error cascades into downstream failures. Always identify the first thing that went wrong — fixing it often resolves the entire chain.

3. Let Failure Modes Emerge from Data

Use grounded theory (open coding → axial coding). Do NOT start with a predetermined taxonomy from LLM research papers. Your application's failure modes are unique.

4. Binary Labels, Not Scales

When annotating traces, use Pass/Fail per specific criterion. Likert scales (1-5) introduce noise and slow you down.

Steps

Phase 1: Collect Traces

  1. Get a quick health check using get_analytics_overview MCP tool before diving into individual traces:

    • Check overall error rate, request volume, and top models
    • This orients the analysis: a 5% error rate on 10K requests/day is a very different situation than 0.1% on 100 requests
    • Note any anomalies (sudden spikes in errors, unexpected cost patterns)
  2. Gather traces for analysis. Target: 100 traces for theoretical saturation.

    From production (if available):

    • Use list_traces from orq MCP to sample recent traces
    • Use orq.ai's filtering and custom views to find interesting subsets

    From synthetic data (if no production data):

    • Use the generate-synthetic-dataset skill to generate diverse inputs
    • Run inputs through the pipeline and collect full traces

    Trace Sampling Strategies — choose the right strategy for your situation:

    Strategy How When to Use
    Random Uniform random sample from all traces Default starting point; establishes baseline failure rate
    Outlier Sort by response length, latency, or tool call count; sample extremes When you suspect edge cases are hiding in unusual traces
    Failure-driven Filter for guardrail triggers, error status codes, or negative user feedback When you know failures exist but don't know the patterns
    Uncertainty Sample traces where existing evaluators disagree or score near thresholds When refining evaluators or investigating borderline cases
    Stratified Sample equally across user segments, features, or time periods When you need representative coverage across dimensions

    Mix strategies: Start with random (50%), then add failure-driven (30%) and outlier (20%) traces for a balanced sample that includes both typical and problematic cases.

  3. Ensure trace completeness. For each trace, you need:

    • The original user input
    • The final system output
    • All intermediate steps (for agents/pipelines): LLM calls, tool calls with args and responses, retrieved documents, reasoning steps
    • Any metadata: latency, token count, model used, cost

Phase 2: Open Coding — Read and Annotate

  1. Read each trace and write freeform notes. For each trace:

    • Read the full trace end-to-end
    • Ask: "Is this output good or bad?" (binary judgment)
    • If bad: "What specifically went wrong?"
    • Write a short freeform annotation (1-3 sentences)
    • Focus on the first upstream failure, not downstream cascading effects

    Track in a simple structure:

    | Trace ID | Pass/Fail | Freeform Annotation |
    |----------|-----------|---------------------|
    | abc123   | Fail      | "Dropped persona on simple factual question, responded in plain English" |
    | def456   | Pass      | "Good — maintained character even on technical topic" |
    | ghi789   | Fail      | "Called wrong tool, used search instead of calculator" |
    
  2. When stuck articulating what's wrong, use these lenses as prompts (not forced categories):

    • Hallucination (fabricated facts)
    • Instruction non-compliance (ignored explicit rules)
    • Persona/tone drift (broke character)
    • Tool misuse (wrong tool, wrong args, misinterpreted results)
    • Context loss (forgot earlier information)
    • Over/under-verbosity (too long or too short)
    • Safety/guardrail bypass (responded to disallowed content)
    • Structural errors (wrong format, missing fields)
  3. Stop when you reach saturation. Continue until:

    • At least 20 bad traces are annotated
    • New traces stop revealing fundamentally new failure types
    • Typically 50-100 traces, depending on pipeline complexity

Phase 3: Axial Coding — Structure the Taxonomy

  1. Group freeform annotations into failure modes. Read through all your notes and cluster similar failures:

    • Some clusters are obvious: "wrong tool" + "hallucinated tool" = Tool Selection Errors
    • Some require splitting: "hallucinated facts" vs "hallucinated user intent" are meaningfully different
    • Some require merging: "too casual for luxury client" + "used jargon with beginner" = Persona-Audience Mismatch
  2. Use LLM assistance (carefully). After coding 30-50 traces:

    • Paste your freeform annotations into an LLM
    • Ask it to propose groupings
    • NEVER accept LLM groupings blindly — always review and adjust manually
    • The LLM helps spot patterns you missed; you make the final taxonomy decisions
  3. Define each failure mode precisely:

    Failure Mode: [Name]
    Description: [1-2 sentence definition]
    Pass: [What "not failing" looks like]
    Fail: [What "failing" looks like]
    Example: [A concrete trace excerpt]
    
  4. Ensure failure modes are:

  • Non-overlapping — each trace should clearly belong to 0 or 1 failure mode
  • Actionable — knowing this failure exists tells you what to fix
  • Observable — two people would agree on whether it applies to a given trace
  • Small in number — aim for 4-8 failure modes, not 20+

Phase 4: Quantify and Prioritize

  1. Label all traces against the structured taxonomy.

    • Add columns: one per failure mode (binary: 0 or 1)
    • For each trace, mark which failure mode(s) apply
    • Compute error rates per failure mode: count / total traces
    | Failure Mode | Count | Rate | Severity |
    |-------------|-------|------|----------|
    | Persona drift on factual Qs | 12 | 24% | High |
    | Tool selection errors | 8 | 16% | High |
    | Over-verbosity | 5 | 10% | Medium |
    | Context loss after 3+ turns | 3 | 6% | Medium |
    
  2. For multi-step pipelines, build a Transition Failure Matrix:

    Define discrete states for each pipeline stage. For each failed trace, identify the first state where something went wrong.

    First Failure In →  ParseReq  DecideTool  GenSQL  ExecSQL  FormatResp
    Last Success ↓
    ParseReq              -          3          0       0         0
    DecideTool             0          -          5       0         1
    GenSQL                 0          0          -      12         0
    ExecSQL                0          0          0       -         2
    

    Sum columns to find the most error-prone stages. Focus debugging on the hottest cells.

  3. Classify each failure mode for action:

    Failure Mode Classification Next Step
    [mode] Specification failure Fix the prompt
    [mode] Generalization failure (code-checkable) Build code-based evaluator
    [mode] Generalization failure (subjective) Build LLM-as-Judge evaluator
    [mode] Trivial bug Fix immediately, no evaluator needed

Phase 5: Output and Handoff

  1. Produce the error analysis report:

    # Error Analysis Report
    **Pipeline:** [name]
    **Traces analyzed:** [N]
    **Pass rate:** [X%]
    **Date:** [date]
    
    ## Failure Taxonomy
    
    ### 1. [Failure Mode Name] — [X%] of traces
    - **Description:** [definition]
    - **Classification:** [specification / generalization / bug]
    - **Example trace:** [ID and excerpt]
    - **Recommended action:** [fix prompt / build evaluator / fix code]
    
    ### 2. [Failure Mode Name] — [X%] of traces
    ...
    
    ## Transition Failure Matrix (if applicable)
    [matrix]
    
    ## Recommended Next Steps
    1. [Highest priority action]
    2. [Second priority]
    3. [Third priority]
    
  2. Hand off to companion skills:

    • Specification failures → fix prompts directly
    • Need test data → generate-synthetic-dataset
    • Need evaluators → build-evaluator
    • Need improvement measurement → run-experiment

Phase 6: Iterate

  1. Expect 2-3 rounds of refinement:
    • Round 1: Initial open/axial coding — rough taxonomy
    • Round 2: Refined definitions, edge cases clarified
    • Round 3: Final taxonomy — stable, non-overlapping, actionable
    • Beyond 3 rounds: diminishing returns

Grader Design Principles (from agent eval best practices)

When analyzing agent traces specifically:

  • Grade outcomes, not paths. Agents regularly find valid approaches eval designers didn't anticipate. Checking exact tool call sequences is too rigid and brittle.
  • Use isolated graders per dimension. Don't build one all-encompassing grader. Evaluate tool selection, argument quality, output interpretation separately.
  • Partial credit for multi-component tasks. A task can partially succeed. Track which components pass/fail independently.
  • Capability vs regression. Capability evals should start with a LOW pass rate (hard tasks). As they reach 100%, graduate them to regression suites.

Common Pitfalls

Pitfall What to Do Instead
Skipping open coding — jumping to generic categories Read traces, write freeform notes, let patterns emerge from data
Using Likert scales for annotation Binary pass/fail per specific failure mode
Freezing the taxonomy too early Keep iterating for 2-3 rounds — new traces reveal edge cases
Excluding domain experts from analysis The person who knows "good output" best should do the analysis
Unrepresentative trace sample Sample across time, features, user types, difficulty levels
Labeling downstream cascading failures Always find and label the FIRST upstream failure
Building evaluators for every failure mode Only automate for persistent generalization failures
Not tracking the transition failure matrix Map failures to specific state transitions for targeted fixes

Documentation & Resolution

When you need to look up orq.ai platform details, check in this order:

  1. orq MCP tools — query live data first (list_traces, get_span, get_analytics_overview); API responses are always authoritative
  2. orq.ai documentation MCP — use search_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmatically
  3. docs.orq.ai — browse official documentation directly
  4. This skill file — may lag behind API or docs changes

When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.

Related skills
Installs
16
GitHub Stars
1
First Seen
9 days ago