experiment-analzyer
Experiment Analyzer
Analyze LLM experiment results to identify performance issues and recommend improvements.
Usage
/analyze-experiment <experiment_id>
The experiment_id is: $ARGUMENTS
Available Tools
Use these MCP tools for analysis:
| Tool | Purpose |
|---|---|
search_llmobs_spans |
Search for LLM Observability spans matching filters (entry point for trace analysis) |
search_datadog_llmobs_spans |
Retrieve and analyze LLM Observability spans with custom attributes |
get_llmobs_trace |
Get full structure of a trace as a span hierarchy tree |
get_llmobs_span_details |
Get detailed metadata for one or more spans (timing, LLM info, content_info) |
get_llmobs_span_content |
Retrieve actual content of a specific field from a span (input, output, messages, etc.) |
expand_llmobs_spans |
Load children of specific spans in a trace for progressive tree exploration |
find_llmobs_error_spans |
Find all error spans in a trace with propagation context |
get_llmobs_agent_loop |
Get chronological view of an agent's execution loop (decisions, tool calls, LLM calls) |
get_llmobs_experiment_summary |
Get high-level summary of an experiment with pre-computed metric stats |
list_llmobs_experiment_events |
List experiment events with filters, sorting, and pagination |
get_llmobs_experiment_event |
Get full details for a single experiment event (input, output, metrics, dimensions) |
get_llmobs_experiment_metric_values |
Get statistical analysis for a metric, optionally segmented by dimension |
get_llmobs_experiment_dimension_values |
Get unique values for a dimension with counts |
Analysis Workflow
Before each action, reason about what you've learned and what to investigate next. This makes the analysis adaptive and transparent.
Reasoning Pattern
For each phase, follow this pattern:
Thought: What do I know? What's surprising? What should I investigate next? Action: Call the appropriate tool Observation: What did it return? What does it mean?
Continue until you have enough evidence to write the final report.
Phase 1: Orient
Thought: I need to understand the experiment's structure before diving in. What metrics exist? What dimensions can I segment by? How many events are there?
Action: Call get_experiment_summary(experiment_id)
Observation: Note:
- Total events and error count
- List of available metrics (classify as exact-match vs rubric/quality)
- List of available dimensions
- Any immediate red flags (high error rate, missing metrics)
Decision point:
- If error_count is high relative to total_events, investigate errors first
- If only 1-2 metrics exist, analysis will be simpler
- If many dimensions exist, prioritize the most meaningful ones
Phase 2: Measure Overall Performance
Thought: Now I need baseline metrics. Which metrics are most important? Are there obvious failures?
Action: For each metric, call get_metric_values(experiment_id, metric_label)
Observation: For each metric, record:
- For boolean metrics: True/False/Empty counts -> calculate pass rate
- For numeric metrics: mean, min, max, distribution shape
- Flag any metric with pass rate < 90% or high variance
Decision point:
- If all metrics show >95% pass rate -> analysis may be brief, focus on edge cases
- If a metric has <70% pass rate -> this is a primary investigation target
- If metrics conflict (one good, one bad) -> investigate the relationship
Phase 3: Segment and Discover Patterns
Thought: Overall metrics hide segment-level problems. Which dimensions might explain failures? Let me check each dimension's distribution first, then segment the worst-performing metric by each dimension.
Action: For each dimension:
get_unique_dimension_values(experiment_id, dimension_key)-> see distributionget_metric_values(experiment_id, metric_label, segment_by_dimension=dimension_key)-> see performance by segment
Observation: For each dimension, note:
- Number of unique values and their frequencies
- Which segments perform worse than overall average
- Any segments with surprisingly good or bad performance
Decision point:
- If a segment has <50% of overall pass rate -> high priority deep dive
- If segment has high impact (many events) AND low performance -> highest priority
- If dimension has only 1-2 values -> may not be useful for segmentation
- Rank segments by: Priority = (1 - segment_pass_rate) x segment_count
Phase 4: Deep Dive into Problem Segments
Thought: I've identified the worst segments. Now I need to understand WHY they fail. Let me pull specific failing events and examine them.
For each top problem segment (limit to top 3-5):
Action:
get_events(experiment_id, filter_dimension_key=X, filter_dimension_value=Y, filter_metric_label=Z, filter_metric_value=false, limit=5)-> get failing event IDsget_event_by_id(experiment_id, event_id)-> examine 2-3 failures in detail
Observation: For each failing event, note:
- What was the input/context?
- What did the model output?
- What was expected?
- What specifically went wrong? (schema error, wrong value, missing field, logic error)
Thought: Looking across these failures, what's the pattern? Is this:
- Prompt ambiguity: Instructions are unclear or contradictory
- Schema compliance: Output format doesn't match requirements
- Tool issue: Tool calling or parsing problems
- Evaluator mismatch: Gold labels may be wrong or inconsistent
- Data quality: Input data has issues
- Logic error: Model reasoning is systematically flawed
Decision point:
- If pattern is clear -> formulate specific fix recommendation
- If pattern is unclear -> pull more examples or check a different angle
- If failures seem random -> may be noise or evaluator issues
Phase 5: Synthesize and Recommend
Thought: I now have evidence for the main issues. Let me formulate actionable recommendations with specific fixes.
For each issue:
- State the problem clearly with evidence
- Propose a specific fix (with actual prompt/code snippet if applicable)
- Explain why it should help (tied to evidence)
- List validation steps
- Note risks/tradeoffs
Thought: What follow-up experiments would validate these fixes? Prioritize by expected impact.
Phase 6: Compile Report
Thought: Time to compile findings into the structured report format. Make sure to:
- Include specific numbers and event IDs
- Show the reasoning chain that led to conclusions
- Prioritize issues by severity x impact
Write the report following the Output Format template below.
Phase 7: Offer to Save
After presenting the report, ask:
"Would you like me to save this report to a markdown file?"
If yes:
- Filename:
experiment-analysis-{experiment_id_first_8_chars}-{YYYY-MM-DD}.md - Location: current working directory
Output Format
Structure your analysis report as follows:
# Experiment Analysis Report
[2-3 sentence executive summary including: experiment purpose, model used, total events, and key finding with specific numbers. Example: "Overall, **classification is strong** (~84% exact match), but **regression labeling is weak** (~59% exact match overall and only **25%** on cases where the model itself says it's a regression)."]
## Overall Performance Summary
**Link to experiment**: https://app.datadoghq.com/llm/experiments/{experiment_id}
**Events**: [count] (model: `[model_name]`)
### Exact-match metrics (decision-critical)
- **[metric_name]**:
- True: [count]
- False: [count]
- Empty: [count]
-> **Pass rate**: [count]/[total] = **[percentage]%**
[Repeat for each exact-match metric]
### Quality/rubric scores
- **[metric_name]**: [count] unique values / [total] events ([variance note])
- For [specific slice]: mean **[value]**, p50 **[value]**
## Worst Segments
| Segment | Severity (key metric) | Impact | Notes |
|---------|----------------------:|-------:|-------|
| `[dimension=value]` | **[metric] pass rate = [X]% ([n]/[total])** | [count] | [Brief explanation] |
## Issue Deep Dives
### Issue 1 - [Descriptive Title]
**Segment**: `[dimension=value]` (impact: **[X] events**)
**Severity**: **[metric] pass rate = [X]%**
**What's happening**:
[Detailed explanation of the failure pattern - what the model is doing wrong and why it matters. 3-5 sentences with specific observations.]
**Representative examples**:
- [Span X](https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=%5B%7B%22p%22%3A%7B%22experimentId%22%3A%22{experiment_id}%22%2C%22spanId%22%3A%22X%22%7D%2C%22i%22%3A%22experiment-details%22%7D%5D&spanId=X):
[input → output → expected, what went wrong]
- [Span Y](https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=%5B%7B%22p%22%3A%7B%22experimentId%22%3A%22{experiment_id}%22%2C%22spanId%22%3A%22Y%22%7D%2C%22i%22%3A%22experiment-details%22%7D%5D&spanId=Y): [input → output → expected, what went wrong]
**IMPORTANT**: Replace `{experiment_id}` with the actual experiment ID from the data, and replace `X` and `Y` with the actual span IDs from your analysis. Each span link should be clickable and point to the specific span in the Datadog UI.
**Root cause hypothesis** (categorized):
- **[Category]**: [Detailed explanation of why this is happening, tied to specific evidence from the examples above.]
**Recommended fix**:
- **What to change**: [Description of the fix]
```text
[Actual prompt snippet, schema definition, or code to add/modify]
-
Why it should help:
- [Point 1 tied to specific evidence]
- [Point 2 tied to specific events]
-
Validation:
- [Specific test step]
- [Metric to track]
- [Events to re-evaluate]
-
Risks/tradeoffs:
- [Potential downside 1]
- [Mitigation approach]
[Repeat for each major issue]
Next Experiments
- [Experiment name] (highest priority): [What to change]. Track [metric] on [segment/events].
- [Experiment name]: [What to change]. [Expected impact].
- [Experiment name]: [What to change]. [Rationale].
[Optional: Offer to pull specific event sets for further analysis]
## Operating Rules
- Be explicit when data is missing or ambiguous; don't guess
- Ground conclusions in specific event evidence with IDs
- Show your math: include counts and percentages, not just rates
- Describe aggregation logic clearly when computing metrics
- Focus on fixes that generalize, not one-off hacks
- Prioritize issues by severity x impact
- Include actual prompt/code snippets in recommendations
- Categorize root causes to help identify patterns across issues
- Always offer to save the report at the end