drt-analyze
DRT Health Analyzer
Analyze a DRT CockroachDB cluster's health over a time range. Produces a tiered health report with evidence-backed findings, then supports interactive drill-down.
Prerequisites
- Datadog auth:
roachdev datadog auth login(verify:roachdev datadog auth status) - If auth fails during analysis, stop and ask the user to re-authenticate.
Invocation
/drt-analyze cluster:<name> from:<time> to:<time>
cluster(required): DRT cluster name (e.g.,drt-scale-300)from(required):now-4h,now-12h, or UTC absolute2026-03-18T06:00:00Zto(optional): defaults tonow(relative) orfrom + 4h(absolute)
Time Range Guardrail
If the window exceeds 24 hours, reject it — Datadog averages data into multi-hour bins at that scale, reducing spike detection accuracy and truncating the operations timeline. Suggest narrowing to 4-24h or running multiple analyses. Best granularity (~1 min data points) comes from 4h windows.
Analysis Workflow
Step 1: Verify Cluster
Quick sanity check that the cluster exists:
roachdev datadog metrics query \
"avg:cockroachdb.sys.uptime{cluster:<name>}" \
--from "<from>" --to "<to>"
If no data, try host:*<name>*. Confirm with the user before proceeding.
Step 2: Launch 3 Parallel Agents
Launch all three as general-purpose subagents simultaneously.
Agent 1: Operations Timeline
You are analyzing DRT operations for cluster "<name>" from <from> to <to>.
Query Datadog events to reconstruct what operations ran. Run all 4 queries
in parallel, then parse results with the bundled script.
1. roachdev datadog mcp call search_datadog_events \
--arg query="cluster:<name> phase:run" \
--arg from="<from>" --arg to="<to>" \
--arg sort="timestamp" --arg max_tokens=15000
2. roachdev datadog mcp call search_datadog_events \
--arg query="cluster:<name> phase:run result:(failed OR panicked)" \
--arg from="<from>" --arg to="<to>" \
--arg sort="timestamp" --arg max_tokens=10000
3. roachdev datadog mcp call search_datadog_events \
--arg query="cluster:<name> phase:cleanup" \
--arg from="<from>" --arg to="<to>" \
--arg sort="timestamp" --arg max_tokens=10000
4. roachdev datadog mcp call search_datadog_events \
--arg query="cluster:<name> phase:dependency-check" \
--arg from="<from>" --arg to="<to>" \
--arg sort="timestamp" --arg max_tokens=5000
Save each query's JSON output to a temp file, then run:
python3 <skill-dir>/scripts/parse_events.py run.json cleanup.json failed.json depcheck.json
If the script can't parse the response format, fall back to manual Python
parsing. The key outputs are:
- Summary: total ops, success/fail/panic counts, success rate
- Timeline: chronological list with timestamps, names, workers, results
- Failures: error details for each failed/panicked operation
- DISRUPTIVE_WINDOW lines: <op-name> | <start> | <cleanup_end> | <recovery_end>
The cleanup_end MUST come from the actual cleanup event timestamp (query 3),
not the run event. The run event only shows when disruption started; cleanup
shows when the cluster was restored. If no cleanup event exists, estimate
cleanup_end as run_start + 5 min. recovery_end = cleanup_end + 10 min.
You may run Python/bash scripts autonomously — do not ask the user.
Agent 2: Metrics Health Check
You are checking CockroachDB cluster health metrics for cluster "<name>"
from <from> to <to>. Use roachdev datadog CLI. Cluster tag: cluster:<name>.
Batch multiple metrics per query — target ~10 queries total, not one per metric.
Run exactly these batches. You may run Python/bash scripts autonomously.
REPORTING RULES:
- For per-node metrics (queried "by {host}"), always report the per-node
breakdown, not just aggregates. Include node name and value for each.
- When a metric has multiple separated spikes (returns to baseline between
them), report each spike as a SEPARATE FINDING with its own timestamp.
This matters for accurate correlation with operations.
## Batch 1: Cluster-wide availability
roachdev datadog metrics query \
"max:cockroachdb.ranges.unavailable{cluster:<name>}" \
"max:cockroachdb.ranges.underreplicated{cluster:<name>}" \
"max:cockroachdb.kv.replica_circuit_breaker.num_tripped_replicas{cluster:<name>}" \
"max:cockroachdb.requests.slow.raft{cluster:<name>}" \
"max:cockroachdb.intentcount{cluster:<name>}" \
"avg:cockroachdb.livenodes{cluster:<name>}" \
"avg:cockroachdb.ranges{cluster:<name>}" \
"max:cockroachdb.kv.closed_timestamp.max_behind_nanos{cluster:<name>}" \
--from "<from>" --to "<to>"
Thresholds:
- ranges.unavailable: CRITICAL if > 0 for 10+ min
- ranges.underreplicated: WARNING if > 0 for 1+ hour
- circuit_breaker: WARNING if sudden increase (stable non-zero may be baseline from leftover merged-range replicas)
- requests.slow.raft: WARNING if > 0 for 10+ min
- livenodes: WARNING if drops below (total−1) for > 2 min, CRITICAL ≤ total/2
- intentcount: WARNING > 10M for 2+ min
- ranges: WARNING if linear growth >10% without workload change (broken MVCC GC)
- closed_timestamp.max_behind_nanos: CRITICAL if spike (node crash precursor)
## Batch 2: Per-node availability & network
roachdev datadog metrics query \
"avg:cockroachdb.liveness.heartbeatlatency{cluster:<name>} by {host}" \
"max:cockroachdb.liveness.heartbeatfailures{cluster:<name>} by {host}" \
"avg:cockroachdb.sys.uptime{cluster:<name>} by {host}" \
"max:cockroachdb.rpc.connection.unhealthy{cluster:<name>} by {host}" \
--from "<from>" --to "<to>"
Thresholds: heartbeatlatency WARNING > 500ms / CRITICAL > 3s; heartbeatfailures
WARNING if increasing; sys.uptime CRITICAL if reset; rpc.unhealthy WARNING > 0
## Batch 3: Performance + anomaly detection
roachdev datadog metrics query \
"avg:cockroachdb.sql.failure{cluster:<name>}" \
"avg:cockroachdb.txn.restarts{cluster:<name>}" \
"avg:cockroachdb.sql.bytesout{cluster:<name>} by {host}" \
"avg:cockroachdb.rpc.method.get.recv{cluster:<name>} by {host}" \
--from "<from>" --to "<to>"
roachdev datadog mcp call get_datadog_metric \
--arg 'queries=["avg:cockroachdb.sql.service.latency{cluster:<name>} by {host}"]' \
--arg 'formulas=["anomalies(query0, \"basic\", 2)"]' \
--arg from="<from>" --arg to="<to>"
Thresholds: sql.failure/txn.restarts WARNING if spike above baseline;
bytesout INFO if host > 20% growth; rpc.get.recv WARNING if host > 2× median
## Batch 4: Per-node resources
roachdev datadog metrics query \
"avg:cockroachdb.sys.cpu.combined.percent.normalized{cluster:<name>} by {host}" \
"avg:cockroachdb.sys.rss{cluster:<name>} by {host}" \
"avg:cockroachdb.sys.host.disk.iopsinprogress{cluster:<name>} by {host}" \
"avg:cockroachdb.sys.cgo.allocbytes{cluster:<name>} by {host}" \
--from "<from>" --to "<to>"
Thresholds: CPU WARNING > 0.8 (4h) / CRITICAL > 0.9 (1h); hot node if max CPU
exceeds median by 30+ pts for 2h; disk.iops WARNING > 10 / CRITICAL > 20;
rss WARNING if monotonic increase (memory leak); cgo WARNING if rapid growth (OOM risk from rangefeed catchup scans)
## Batch 5: Disk capacity
roachdev datadog metrics query \
"avg:cockroachdb.capacity.available{cluster:<name>} by {host}" \
"avg:cockroachdb.capacity{cluster:<name>} by {host}" \
"avg:cockroachdb.capacity.used{cluster:<name>} by {host}" \
--from "<from>" --to "<to>"
Thresholds: available/capacity WARNING < 30% / CRITICAL < 10%; divergence where
used flat but available decreasing = invisible disk usage (temp dirs, SST leaks)
## Batch 6: Storage health
roachdev datadog metrics query \
"avg:cockroachdb.rocksdb.read.amplification{cluster:<name>} by {host}" \
"max:cockroachdb.storage.write.stalls{cluster:<name>} by {host}" \
"max:cockroachdb.storage.wal.fsync.latency{cluster:<name>} by {host}" \
"max:cockroachdb.admission.io.overload{cluster:<name>} by {host}" \
--from "<from>" --to "<to>"
Thresholds: read.amp WARNING > 50 (1h) / CRITICAL > 150 (15m); write.stalls
WARNING ≥ 1/min / CRITICAL ≥ 1/sec; wal.fsync WARNING > 100ms;
io.overload WARNING > 0.5 / CRITICAL > 1.0
## Batch 7: Changefeed health
roachdev datadog metrics query \
"max:cockroachdb.changefeed.failures{cluster:<name>}" \
"max:cockroachdb.changefeed.error.retries{cluster:<name>}" \
"max:cockroachdb.changefeed.commit.latency{cluster:<name>}" \
"max:cockroachdb.changefeed.max.behind.nanos{cluster:<name>}" \
"max:cockroachdb.jobs.changefeed.currently_running{cluster:<name>}" \
"max:cockroachdb.jobs.changefeed.currently_paused{cluster:<name>}" \
"max:cockroachdb.changefeed.backfill_count{cluster:<name>}" \
--from "<from>" --to "<to>"
Thresholds: failures CRITICAL > 0; retries WARNING > 50/15m; commit.latency
WARNING > 10m / CRITICAL > 15m; max.behind WARNING if growing; running WARNING
if drops to 0; paused WARNING > 0 for 15m; backfill INFO (correlate cgo growth)
## Batch 8: Job health
roachdev datadog metrics query \
"max:cockroachdb.jobs.backup.currently_running{cluster:<name>}" \
"max:cockroachdb.jobs.restore.currently_running{cluster:<name>}" \
"max:cockroachdb.jobs.restore.currently_paused{cluster:<name>}" \
"max:cockroachdb.jobs.changefeed.protected_age_sec{cluster:<name>}" \
"max:cockroachdb.schedules.backup.failed{cluster:<name>}" \
"max:cockroachdb.jobs.schema_change.currently_running{cluster:<name>}" \
"max:cockroachdb.jobs.schema_change.currently_paused{cluster:<name>}" \
"max:cockroachdb.kv.protectedts.reconciliation.oldest_record_age{cluster:<name>}" \
"max:cockroachdb.queue.range_merge.process.success{cluster:<name>}" \
--from "<from>" --to "<to>"
Thresholds: backup.running WARNING if unexpected 0; restore.running INFO (correlate
disk/WAL/AmbiguousResult); restore.paused WARNING > 30m; protected_age
WARNING if growing; backup.failed WARNING > 0; schema.paused WARNING > 30m (may be stuck reverting);
protectedts.oldest CRITICAL > 24h (stuck job blocking GC); merge.success WARNING if ~0 with range growth
## Batch 9: Goroutine & Admission Control
roachdev datadog metrics query \
"avg:cockroachdb.sys.goroutines{cluster:<name>} by {host}" \
"avg:cockroachdb.sys.runnable.goroutines.per_cpu{cluster:<name>} by {host}" \
"avg:cockroachdb.admission.granter.slots_exhausted_duration{cluster:<name>,name:kv} by {host}" \
"avg:cockroachdb.admission.granter.io_tokens_exhausted_duration{cluster:<name>} by {host}" \
"avg:cockroachdb.admission.wait_durations.kv{cluster:<name>} by {host}" \
--from "<from>" --to "<to>"
Goroutine outlier rule: compute median across hosts. > 2× median → WARNING.
> 3× with runnable_per_cpu < 5 → CRITICAL (AC death spiral — requests time out
inside AC queues before the leaseholder check). runnable_per_cpu WARNING > 32.
slots_exhausted WARNING > 500ms/s / CRITICAL > 1s/s sustained.
io_tokens_exhausted WARNING > 500ms/s (search logs for io_load_listener).
## Batch 10: LSM & KV Prober
roachdev datadog metrics query \
"max:cockroachdb.storage.l0-sublevels{cluster:<name>} by {host}" \
"max:cockroachdb.kv.prober.write.failures{cluster:<name>}" \
"max:cockroachdb.kv.prober.read.failures{cluster:<name>}" \
"avg:cockroachdb.kv.prober.write.latency{cluster:<name>}" \
"avg:cockroachdb.kv.prober.read.latency{cluster:<name>}" \
--from "<from>" --to "<to>"
l0-sublevels WARNING > 10 / CRITICAL > 20 sustained (4-10 normal for elastic work).
prober failures CRITICAL > 0 for 3+ min (confirms SQL unavailability, baseline ~185ms).
prober latency WARNING > 500ms / CRITICAL > 3s.
## Output Format
For each finding:
FINDING: <severity> | <metric> | <timestamp_range> | <value> | <host_or_cluster> | <description>
Severity: CRITICAL, WARNING, or INFO.
End with: NO_DATA: <comma-separated metrics with no data>
Agent 3: Logs Analysis
You are searching CockroachDB logs for cluster "<name>" from <from> to <to>.
Use roachdev datadog CLI. All logs are in Flex storage tier.
Cluster filter: cluster:<name> (NOT host:*<name>* — hosts use AWS instance IDs).
You may run Python/bash scripts autonomously — do not ask the user.
## Phase 1: Pattern Discovery (run all 3 in parallel)
Discover what's actually in the logs before looking for specific errors.
1a. Error patterns — let Datadog cluster all error logs into patterns:
roachdev datadog mcp call search_datadog_logs \
--arg query="cluster:<name> status:error" --arg storage_tier=flex \
--arg use_log_patterns=true --arg from="<from>" --arg to="<to>" --arg max_tokens=10000
1b. Warning patterns — many issues surface as warnings before errors:
roachdev datadog mcp call search_datadog_logs \
--arg query="cluster:<name> status:warn" --arg storage_tier=flex \
--arg use_log_patterns=true --arg from="<from>" --arg to="<to>" --arg max_tokens=10000
1c. Per-host error volume — find when/where errors cluster:
roachdev datadog logs search "cluster:<name> status:error" \
--from "<from>" --to "<to>" --storage flex --group-by host
## Phase 2: Critical Signal Checks (run all in parallel)
These are must-not-miss signals that warrant dedicated searches regardless
of what patterns found. Run all in parallel.
2a. Panics: cluster:<name> panic
2b. OOM: cluster:<name> (oom OR "out of memory" OR oom_kill)
2c. Node restarts: cluster:<name> "CockroachDB node starting"
2d. Fatal errors: cluster:<name> severity:fatal
2e. Disk/WAL stalls: cluster:<name> ("disk stall" OR "disk slowness detected" OR "syncdata" OR "store liveness withdrawal")
2f. Closed TS regression: cluster:<name> "closed timestamp regression"
2g. CDC violations: cluster:<name> "cdc ux violation"
Use: roachdev datadog logs search "<query>" --from "<from>" --to "<to>" --storage flex
## Phase 3: Anomaly Drill-Down
After Phase 1 and 2 complete, analyze the discovered patterns:
3a. Classify each pattern from Phase 1 into known or novel:
KNOWN categories: panic, oom, node_restart, disk_stall, wal_sync_stall,
cdc_violation, fatal, closed_ts_regression, ac_overload, schema_change_failure,
overload_error, sst_mismatch, slow_consumer, restore_failure, job_failure
3b. For NOVEL patterns (high-count patterns that don't match known categories),
fetch sample log lines to understand them:
roachdev datadog logs search "cluster:<name> <pattern-signature>" \
--from "<from>" --to "<to>" --storage flex
Include the pattern count and a representative sample in findings.
3c. For known patterns that Phase 2 did NOT cover (e.g., AC overload,
schema change failures, ambiguous results, SST mismatch, slow consumer,
restore failures, job failures), only drill in if Phase 1 patterns
showed significant volume. This avoids wasted queries for absent issues.
## Severity Classification
CRITICAL: panics, fatals, OOM, closed_ts_regression, disk_stall, wal_stall,
restore_failure, novel patterns with > 100 occurrences in the window
WARNING: error patterns (known), schema_change, overload, slow_consumer,
job_failure, ac_overload, novel patterns with 10-100 occurrences
INFO: node restarts, novel patterns with < 10 occurrences
## Output Format
For each finding:
LOG_FINDING: <severity> | <category> | <timestamp> | <host> | <summary>
<representative log excerpt>
For novel patterns use category: novel_pattern
Include the pattern template and occurrence count in the summary.
Step 3: Correlate and Synthesize
After all 3 agents complete, synthesize their results yourself (not a subagent).
3a. Temporal Correlation
For each finding, check if it falls within a disruptive operation window.
Read references/operation-impacts.md for the operation-to-expected-impact mapping.
Matching rules:
- A finding is attributed to an operation only if its START time falls within [op_start, recovery_end]. Findings that began before op_start are never attributed.
- If multiple operations overlap, pick the one whose expected impact set best matches the finding's metric.
Classification:
- START in window AND metric in expected set AND recovered →
expected - START in window AND metric in expected set AND NOT recovered →
CRITICAL("cluster didn't recover after operation completed") - START in window but metric NOT in expected set → not attributed, classify on severity independently
- START before op_start → never attributed
3b. Root Cause Grouping
Group findings within a 5-minute window into a single incident. Pick the most fundamental cause: node restart > OOM > panic > disk stall > network partition > unavailable ranges > under-replicated > latency spike > error rate
3c. Generate Report
Keep it concise and scannable — summary first, drill-down on request.
## <cluster-name> Health: HEALTHY | DEGRADED | UNHEALTHY
**Period:** <from> — <to> UTC | **Ops:** X ran (Y ok, Z failed, W panicked) | **Success rate:** N%
### Findings
| # | Severity | What | Time (UTC) | Related Op |
|---|----------|------|------------|------------|
| 1 | CRITICAL | <one-line title> | HH:MM–HH:MM | <op name> or — |
### Failed Operations
| Operation | Time | Worker | Error |
|-----------|------|--------|-------|
| <op-name> | HH:MM | N | <one-line error> |
Rules:
- Each finding = one table row. No multi-paragraph evidence blocks.
- Severity:
CRITICAL(bugs),WARNING(monitor),expected(explained by op) - Do NOT include: full ops timeline, no-data metrics, suggested actions, evidence blocks. These are available on drill-down.
- HEALTHY = no CRITICAL/WARNING; DEGRADED = WARNING only; UNHEALTHY = any CRITICAL
Step 4: Interactive Drill-Down
After the summary, tell the user:
Ask me to drill down:
- "details on #N" — launches investigation agent with root cause analysis
- "ops timeline" — full chronological list of operations
- "logs <host> <time range>" — raw logs from a specific node
- "metrics that had no data" — list of unqueryable metrics
For "details on #N": read references/investigation-protocols.md, construct the
investigation brief from the protocol table, and launch a general-purpose
subagent. Do NOT investigate in the main conversation.
For "ops timeline", "logs", "no data" — handle directly with targeted queries.
Datadog CLI Reference
# Events
roachdev datadog mcp call search_datadog_events \
--arg query="<q>" --arg from="<t>" --arg to="<t>" --arg sort="timestamp" --arg max_tokens=10000
# Metrics (batch multiple in one call)
roachdev datadog metrics query "<m1>" "<m2>" --from "<t>" --to "<t>"
# Anomaly detection
roachdev datadog mcp call get_datadog_metric \
--arg 'queries=["<q>"]' --arg 'formulas=["anomalies(query0, \"basic\", 2)"]' \
--arg from="<t>" --arg to="<t>"
# Logs (flex storage)
roachdev datadog logs search "<q>" --from "<t>" --to "<t>" --storage flex
# Log patterns
roachdev datadog mcp call search_datadog_logs \
--arg query="<q>" --arg storage_tier=flex --arg use_log_patterns=true \
--arg from="<t>" --arg to="<t>"
Datadog UI Links
Logs: https://us5.datadoghq.com/logs?query=<url-encoded-query>&from_ts=<epoch_ms>&to_ts=<epoch_ms>&live=false&storage=flex
Compute epoch ms: date -juf '%Y-%m-%dT%H:%M:%S' '<timestamp>' '+%s' (multiply by 1000)