experiment-analyzer
Unified Experiment Analyzer
Analyzes one or two LLM experiments. Supports four modes based on inputs:
| Inputs | Mode |
|---|---|
| 2 IDs, no question | Comparative Exploratory |
| 2 IDs + question | Comparative Q&A |
| 1 ID, no question | Single Exploratory |
| 1 ID + question | Single Q&A |
Usage
/experiment-analyzer <experiment_id_1> [experiment_id_2] [question text] [--output agent|file|notebook]
Arguments: $ARGUMENTS
Available Tools
| Tool | Purpose |
|---|---|
mcp__datadog-llmo-mcp__get_llmobs_experiment_summary |
Get total events, error count, metrics stats, available dimensions |
mcp__datadog-llmo-mcp__list_llmobs_experiment_events |
Query events with filters, sorting, pagination |
mcp__datadog-llmo-mcp__get_llmobs_experiment_event |
Get full event details (input, output, expected_output, metrics) |
mcp__datadog-llmo-mcp__get_llmobs_experiment_metric_values |
Get metric stats overall and segmented by dimension |
mcp__datadog-llmo-mcp__get_llmobs_experiment_dimension_values |
List unique values for a dimension with counts |
mcp__datadog-mcp-core__create_datadog_notebook |
Export report as a Datadog notebook |
Phase 0 — Mode & Output Resolution
Parse $ARGUMENTS:
- Extract one or two UUID-format strings as experiment IDs (first = baseline/primary, second = candidate).
- Extract
--output agent|file|notebookflag if present. - The remaining text (after IDs and flags) is the question, if any.
Mode determination:
- 2 IDs + question → Comparative Q&A
- 2 IDs, no question → Comparative Exploratory
- 1 ID + question → Single Q&A
- 1 ID, no question → Single Exploratory
Output mode determination:
If --output was provided in arguments, use that mode and skip asking.
Otherwise, ask one combined clarification message before proceeding. Cover only what is genuinely unclear:
- If mode is ambiguous (e.g., user asked a question but only provided IDs in surrounding context), ask in plain language: "Did you have a specific question in mind, or would you like an exploratory analysis?"
- Always ask about output destination if not specified: "Would you like me to save this to a file, export it to a Datadog notebook, or is displaying it here in chat fine?"
Never ask multiple rounds of clarifications. One message covers everything unresolved.
Output modes:
- Agent (default): Display the full report in the conversation.
- File: Before starting, propose a path:
evals/reports/YYYY-MM-DD-<experiment-slug>-analysis.mdPresent it to the user and let them confirm or adjust. Then proceed. - Notebook: Use
mcp__datadog-mcp-core__create_datadog_notebookat the end. If the tool is unavailable, output these setup instructions instead of failing:
Then ask: "Would you like to fall back to file or agent output instead?"To enable Datadog notebook export, add the MCP server: claude mcp add --transport http datadog-mcp https://mcp.datadoghq.com/api/unstable/mcp-server See: https://docs.datadoghq.com/bits_ai/mcp_server/setup/
After resolving mode and output, proceed fully automatically through Phases 1–5 with no further user interaction.
Phase 1 — Orient
Comparative: Call get_llmobs_experiment_summary for both experiments. Produce a side-by-side comparison:
- Scale: total events and error rate for each
- Metrics: which metrics exist in each; which are shared
- Dimensions: which dimensions exist in each; which are shared
- Immediate red flags (high error rate, missing metrics, sparse data)
- Obvious improvements or regressions visible at the summary level
Single: Call get_llmobs_experiment_summary for the experiment. Determine:
- Total events, error count, error rate
- Available metrics (classify as exact-match vs. rubric/quality)
- Available dimensions for segmentation
- Any immediate red flags
Phase 2 — Signal Discovery + UI Links
Comparative: Using only shared metrics and dimensions, identify:
- Segments where the candidate outperforms the baseline
- Segments where the candidate regresses
- Error types present in one but rare in the other
- Distribution shifts or coverage gaps
- Tradeoffs (e.g., higher recall, lower precision)
Generate Datadog comparison UI links:
- Base URL:
https://app.datadoghq.com/llm/experiment-comparison - Required params:
baselineExperimentId,experimentIds(candidate%2Cbaseline),tableView=all - Optional (include if discoverable):
project,compareDatasetId,selectedEvaluation selectedEvaluationpriority: overall/overall_score/rubric metric → primary metric → first shared metric- Generate 2–4 links: primary comparison, regression view, calibration view (if applicable), worst-segment view (only if supported — never fabricate filters)
Single: Measure per-metric performance across all dimensions. Identify:
- Worst-performing segments (by metric × dimension)
- Any segments with surprising pass rates
- Overall pass rates and variance
Generate Datadog experiment UI link:
https://app.datadoghq.com/llm/experiments/{experiment_id}
Phase 3 — Deep Dives
Run all necessary deep dives automatically. Do not ask for approval or pause.
Q&A modes: Focus deep dives on what is needed to answer the question directly. Pull specific events, segment by relevant dimensions, inspect examples.
Exploratory modes: Investigate the most interesting signals broadly:
- Per-segment and per-class delta analysis (comparative) or pass-rate analysis (single)
- Error overlap vs. unique failure mode analysis
- Sampling and qualitative inspection of representative failures (2–5 per issue)
- Clustered error theme analysis
Rules:
- Prefer cheap, high-signal analyses first; do not stop early.
- Mask or redact PII in all outputs.
- Avoid destructive actions.
For each sampled event, generate a direct span link:
https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=[{"p":{"experimentId":"{experiment_id}","spanId":"{span_id}"},"i":"experiment-details"}]&spanId={span_id}
Phase 4 — Synthesis
Comparative Exploratory:
- Clear wins where the candidate improves on the baseline
- Clear regressions or risks the candidate introduces
- Neutral or unchanged areas
- Root-cause hypotheses (1–4), tied to evidence
- Prioritized recommendations: ship as-is / block / gate by segment / combine behaviors
Comparative Q&A:
- Direct answer to the question with a clear verdict
- Supporting evidence (metrics, percentages, event examples)
- Relevant context (e.g., caveats, data limitations)
Single Exploratory:
- Overall performance assessment
- Worst-performing segments and root causes
- Hypotheses for why failures occur
- Recommended next experiments
Single Q&A:
- Direct answer to the question with a clear verdict
- Supporting evidence from the experiment data
All modes: use quantified deltas/rates wherever possible. Redact PII.
Phase 5 — Output Delivery
Agent: Present the full report in the conversation using the report format below.
File: Write the report to the pre-confirmed path. Confirm with: "Report saved to <path>."
Notebook: Call mcp__datadog-mcp-core__create_datadog_notebook with the report content structured as notebook cells. Return the notebook URL.
Phase 6 — Conversational Follow-up
After delivering the report, append a follow-up section:
---
## Want to explore further?
Here are a few directions based on the findings:
1. [Specific question derived from actual findings — e.g., "Want me to dig deeper into why the SQL scenarios regressed in the candidate?"]
2. [Another specific follow-up — e.g., "Should I compare error patterns between the two failing clusters?"]
3. [A third option if relevant]
Do you have any other questions about this analysis?
Stay active after the report. Answer follow-up questions using the same MCP tools, referencing findings already gathered. Do not re-run analyses you've already performed unless new questions require it.
Report Format
# Experiment Analysis Report
## [Mode: Comparative Exploratory | Comparative Q&A | Single Exploratory | Single Q&A]
[2–3 sentence executive summary: experiment(s) purpose, scale, and key finding with specific numbers.]
## Orientation
[Side-by-side table for comparative; summary table for single. Include: events, error rate, metrics, dimensions.]
## [Signals | Answer to Question]
[For exploratory: ranked table of signals/segments with metric deltas and impact counts.]
[For Q&A: direct answer with verdict, then supporting evidence.]
## Deep Dive Findings
### [Issue/Finding Title]
**Segment**: `[dimension=value]` | **Impact**: N events | **Severity**: metric pass rate = X%
**What's happening**: [3–5 sentences with specific observations]
**Representative examples**:
- [Span link]: [input → output → expected, what went wrong]
**Root cause hypothesis**: [Category]: [Explanation tied to evidence]
**Recommendation**: [Specific, actionable next step]
---
[Repeat for each major issue]
## Summary & Recommendations
[Wins, regressions, neutral areas, prioritized actions. For Q&A: verdict + rationale.]
## UI Links
[All generated Datadog UI links with labels]
Operating Rules
- Do not assume anything about the experiment (model, task, metrics, schema, dimensions). Infer everything by inspecting the data.
- Ground all conclusions in specific evidence: event IDs, counts, percentages.
- Show math: include counts and rates, not just qualitative claims.
- Avoid speculative explanations not supported by observed evidence.
- Mask or redact PII in all user-visible output.
- Never show internal tool calls, schemas, or implementation details to the user.