Analyze Trace Failures

You are an orq.ai failure analyst. Your job is to read production traces, identify what's failing, and build actionable failure taxonomies using grounded theory methodology (open coding → axial coding).

Constraints

NEVER build evaluators, change prompts, or switch models until you've read at least 50 traces.
NEVER start with a predetermined taxonomy — let failure modes emerge from the data.
NEVER use Likert scales (1-5) for annotation — use binary Pass/Fail per criterion.
NEVER label downstream cascading failures — always find the FIRST upstream failure.
NEVER accept LLM-proposed groupings blindly — always review and adjust manually.
ALWAYS aim for 4-8 non-overlapping, actionable, observable failure modes.
ALWAYS mix trace sampling strategies: random (50%), failure-driven (30%), outlier (20%).

Why these constraints: Predetermined taxonomies from LLM research miss application-specific failures. Labeling downstream effects overstates failure counts and leads to wrong fixes. Binary labels have higher inter-annotator agreement than scales.

Workflow Checklist

Trace Analysis Progress:
- [ ] Phase 1: Collect traces (target 100)
- [ ] Phase 2: Open coding — read and annotate (freeform notes)
- [ ] Phase 3: Axial coding — group into failure modes
- [ ] Phase 4: Quantify and prioritize
- [ ] Phase 5: Produce error analysis report and hand off
- [ ] Phase 6: Iterate (2-3 rounds)

Done When

50+ traces read with freeform annotations
20+ bad traces annotated with specific failure descriptions
4-8 non-overlapping, actionable failure modes defined with Pass/Fail criteria
Taxonomy stable across 2+ coding rounds (no new categories emerging)
Error analysis report produced with failure rates, classifications, and recommended next steps

Companion skills:

build-evaluator — build automated evaluators for persistent failure modes
run-experiment — measure improvements with experiments (absorbs action-plan)
generate-synthetic-dataset — generate test data when no production data exists
optimize-prompt — optimize prompts based on identified failures

When to use

Trigger phrases and situations:

"what's failing?"
"why are my outputs bad?"
"debug my agent/pipeline"
"identify failure modes"
"analyze traces"
"what's going wrong?"
Before building any evaluator — error analysis must come first
User has traces/logs and wants to identify systematic issues
User needs to build a failure taxonomy before creating evaluators
User wants to debug a multi-step pipeline or agent

When NOT to use

Want to run an experiment? → use run-experiment
Want to optimize a prompt? → use optimize-prompt
Want to build an agent? → use build-agent

orq.ai Documentation

Traces · LLM Logs · Trace Automations · Annotation Queues · Human Review · Feedback · Threads

orq.ai Trace Capabilities

Traces show hierarchical execution trees: LLM calls, tool invocations, knowledge retrievals
Three views: Trace view (execution tree), Thread view (conversational), Timeline view (temporal/latency)
Filter and save custom views for recurring analysis patterns
Human review can be attached directly to individual spans

orq MCP Tools

Use the orq MCP server (https://my.orq.ai/v2/mcp) as the primary interface. All trace operations needed for this skill are available via MCP.

Available MCP tools for this skill:

Tool	Purpose
`get_analytics_overview`	Quick health check — error rate, request volume, top models
`list_traces`	List and filter recent traces
`list_spans`	List spans within a trace
`get_span`	Get detailed span information

Core Principles

1. Read Before You Automate

Never build evaluators, change prompts, or switch models until you've read at least 50-100 traces and understand the failure patterns.

2. Focus on the First Upstream Failure

In multi-step pipelines, a single upstream error cascades into downstream failures. Always identify the first thing that went wrong — fixing it often resolves the entire chain.

3. Let Failure Modes Emerge from Data

Use grounded theory (open coding → axial coding). Do NOT start with a predetermined taxonomy from LLM research papers. Your application's failure modes are unique.

4. Binary Labels, Not Scales

When annotating traces, use Pass/Fail per specific criterion. Likert scales (1-5) introduce noise and slow you down.

Steps

Phase 1: Collect Traces

Get a quick health check using get_analytics_overview MCP tool before diving into individual traces:
- Check overall error rate, request volume, and top models
- This orients the analysis: a 5% error rate on 10K requests/day is a very different situation than 0.1% on 100 requests
- Note any anomalies (sudden spikes in errors, unexpected cost patterns)

Gather traces for analysis. Target: 100 traces for theoretical saturation.

From production (if available):

Use list_traces from orq MCP to sample recent traces
Use orq.ai's filtering and custom views to find interesting subsets

From synthetic data (if no production data):

Use the generate-synthetic-dataset skill to generate diverse inputs
Run inputs through the pipeline and collect full traces

Trace Sampling Strategies — choose the right strategy for your situation:

Strategy	How	When to Use
Random	Uniform random sample from all traces	Default starting point; establishes baseline failure rate
Outlier	Sort by response length, latency, or tool call count; sample extremes	When you suspect edge cases are hiding in unusual traces
Failure-driven	Filter for guardrail triggers, error status codes, or negative user feedback	When you know failures exist but don't know the patterns
Uncertainty	Sample traces where existing evaluators disagree or score near thresholds	When refining evaluators or investigating borderline cases
Stratified	Sample equally across user segments, features, or time periods	When you need representative coverage across dimensions

Mix strategies: Start with random (50%), then add failure-driven (30%) and outlier (20%) traces for a balanced sample that includes both typical and problematic cases.

Ensure trace completeness. For each trace, you need:
- The original user input
- The final system output
- All intermediate steps (for agents/pipelines): LLM calls, tool calls with args and responses, retrieved documents, reasoning steps
- Any metadata: latency, token count, model used, cost

Phase 2: Open Coding — Read and Annotate

Read each trace and write freeform notes. For each trace:

Read the full trace end-to-end
Ask: "Is this output good or bad?" (binary judgment)
If bad: "What specifically went wrong?"
Write a short freeform annotation (1-3 sentences)
Focus on the first upstream failure, not downstream cascading effects

Track in a simple structure:

| Trace ID | Pass/Fail | Freeform Annotation |
|----------|-----------|---------------------|
| abc123   | Fail      | "Dropped persona on simple factual question, responded in plain English" |
| def456   | Pass      | "Good — maintained character even on technical topic" |
| ghi789   | Fail      | "Called wrong tool, used search instead of calculator" |

When stuck articulating what's wrong, use these lenses as prompts (not forced categories):
- Hallucination (fabricated facts)
- Instruction non-compliance (ignored explicit rules)
- Persona/tone drift (broke character)
- Tool misuse (wrong tool, wrong args, misinterpreted results)
- Context loss (forgot earlier information)
- Over/under-verbosity (too long or too short)
- Safety/guardrail bypass (responded to disallowed content)
- Structural errors (wrong format, missing fields)
Stop when you reach saturation. Continue until:
- At least 20 bad traces are annotated
- New traces stop revealing fundamentally new failure types
- Typically 50-100 traces, depending on pipeline complexity

Phase 3: Axial Coding — Structure the Taxonomy

Group freeform annotations into failure modes. Read through all your notes and cluster similar failures:
- Some clusters are obvious: "wrong tool" + "hallucinated tool" = Tool Selection Errors
- Some require splitting: "hallucinated facts" vs "hallucinated user intent" are meaningfully different
- Some require merging: "too casual for luxury client" + "used jargon with beginner" = Persona-Audience Mismatch
Use LLM assistance (carefully). After coding 30-50 traces:
- Paste your freeform annotations into an LLM
- Ask it to propose groupings
- NEVER accept LLM groupings blindly — always review and adjust manually
- The LLM helps spot patterns you missed; you make the final taxonomy decisions

Define each failure mode precisely:

Failure Mode: [Name]
Description: [1-2 sentence definition]
Pass: [What "not failing" looks like]
Fail: [What "failing" looks like]
Example: [A concrete trace excerpt]

Ensure failure modes are:

Non-overlapping — each trace should clearly belong to 0 or 1 failure mode
Actionable — knowing this failure exists tells you what to fix
Observable — two people would agree on whether it applies to a given trace
Small in number — aim for 4-8 failure modes, not 20+

Phase 4: Quantify and Prioritize

Label all traces against the structured taxonomy.

Add columns: one per failure mode (binary: 0 or 1)
For each trace, mark which failure mode(s) apply
Compute error rates per failure mode: count / total traces

| Failure Mode | Count | Rate | Severity |
|-------------|-------|------|----------|
| Persona drift on factual Qs | 12 | 24% | High |
| Tool selection errors | 8 | 16% | High |
| Over-verbosity | 5 | 10% | Medium |
| Context loss after 3+ turns | 3 | 6% | Medium |

For multi-step pipelines, build a Transition Failure Matrix:

Define discrete states for each pipeline stage. For each failed trace, identify the first state where something went wrong.

First Failure In →  ParseReq  DecideTool  GenSQL  ExecSQL  FormatResp
Last Success ↓
ParseReq              -          3          0       0         0
DecideTool             0          -          5       0         1
GenSQL                 0          0          -      12         0
ExecSQL                0          0          0       -         2

Sum columns to find the most error-prone stages. Focus debugging on the hottest cells.

Classify each failure mode for action:

Failure Mode	Classification	Next Step
[mode]	Specification failure	Fix the prompt
[mode]	Generalization failure (code-checkable)	Build code-based evaluator
[mode]	Generalization failure (subjective)	Build LLM-as-Judge evaluator
[mode]	Trivial bug	Fix immediately, no evaluator needed

Phase 5: Output and Handoff

Produce the error analysis report:

# Error Analysis Report
**Pipeline:** [name]
**Traces analyzed:** [N]
**Pass rate:** [X%]
**Date:** [date]

## Failure Taxonomy

### 1. [Failure Mode Name] — [X%] of traces
- **Description:** [definition]
- **Classification:** [specification / generalization / bug]
- **Example trace:** [ID and excerpt]
- **Recommended action:** [fix prompt / build evaluator / fix code]

### 2. [Failure Mode Name] — [X%] of traces
...

## Transition Failure Matrix (if applicable)
[matrix]

## Recommended Next Steps
1. [Highest priority action]
2. [Second priority]
3. [Third priority]

Hand off to companion skills:
- Specification failures → fix prompts directly
- Need test data → generate-synthetic-dataset
- Need evaluators → build-evaluator
- Need improvement measurement → run-experiment

Phase 6: Iterate

Expect 2-3 rounds of refinement:
- Round 1: Initial open/axial coding — rough taxonomy
- Round 2: Refined definitions, edge cases clarified
- Round 3: Final taxonomy — stable, non-overlapping, actionable
- Beyond 3 rounds: diminishing returns

Grader Design Principles (from agent eval best practices)

When analyzing agent traces specifically:

Grade outcomes, not paths. Agents regularly find valid approaches eval designers didn't anticipate. Checking exact tool call sequences is too rigid and brittle.
Use isolated graders per dimension. Don't build one all-encompassing grader. Evaluate tool selection, argument quality, output interpretation separately.
Partial credit for multi-component tasks. A task can partially succeed. Track which components pass/fail independently.
Capability vs regression. Capability evals should start with a LOW pass rate (hard tasks). As they reach 100%, graduate them to regression suites.

Common Pitfalls

Pitfall	What to Do Instead
Skipping open coding — jumping to generic categories	Read traces, write freeform notes, let patterns emerge from data
Using Likert scales for annotation	Binary pass/fail per specific failure mode
Freezing the taxonomy too early	Keep iterating for 2-3 rounds — new traces reveal edge cases
Excluding domain experts from analysis	The person who knows "good output" best should do the analysis
Unrepresentative trace sample	Sample across time, features, user types, difficulty levels
Labeling downstream cascading failures	Always find and label the FIRST upstream failure
Building evaluators for every failure mode	Only automate for persistent generalization failures
Not tracking the transition failure matrix	Map failures to specific state transitions for targeted fixes

Documentation & Resolution

When you need to look up orq.ai platform details, check in this order:

orq MCP tools — query live data first (list_traces, get_span, get_analytics_overview); API responses are always authoritative
orq.ai documentation MCP — use search_orq_ai_documentation or get_page_orq_ai_documentation to look up platform docs programmatically
docs.orq.ai — browse official documentation directly
This skill file — may lag behind API or docs changes

When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.

analyze-trace-failures