signoz-investigating-alerts
Alert Investigate
Diagnose why a SigNoz alert fired. The skill correlates the alert's own
signal with neighbor signals around the fire window, and surfaces a
ranked list of likely causes with supporting evidence. It is the
companion to signoz-explaining-alerts — explain decodes the rule
statically; investigate diagnoses a specific incident.
Prerequisites
This skill calls SigNoz MCP server tools heavily (signoz:signoz_get_alert,
signoz:signoz_get_alert_history, signoz:signoz_execute_builder_query,
signoz:signoz_query_metrics, signoz:signoz_search_traces, signoz:signoz_search_logs,
signoz:signoz_get_trace_details, etc.). Before running the workflow,
confirm the signoz:signoz_* tools are available. If they are not, the
SigNoz MCP server is not installed or configured — stop and direct
the user to set it up: https://signoz.io/docs/ai/signoz-mcp-server/.
The investigation depends on correlating multiple MCP queries; without
the server there is no way to ground the analysis.
When to use
Use this skill when the user wants to:
- Understand why a specific alert fired.
- Find the root cause of a recent incident triggered by an alert.
- Correlate the alert's signal with related metrics, traces, and logs.
- Distinguish "real signal" fires from flapping or threshold-mistuning.
Do NOT use when the user wants to:
- Understand what an alert is configured to monitor →
signoz-explaining-alerts. - Create a new alert →
signoz-creating-alerts. - Modify an alert (raise threshold, add hysteresis) → call
signoz:signoz_update_alertdirectly. - Run a free-form ad-hoc investigation without an alert as the anchor →
signoz-generating-queriesorsignoz-writing-clickhouse-queries.
Required inputs
| Input | Required | Source if missing |
|---|---|---|
| Alert identifier (rule ID or name) | yes | $ARGUMENTS[0] or recent context |
| Time window | no | default to most recent fire from signoz:signoz_get_alert_history |
If the alert name is fuzzy, this skill is best-effort (read-only):
- Call
signoz:signoz_list_alert_rules, paginate, fuzzy-match the name. - State the interpretation: "Investigating fire of 'High Error Rate — Checkout' (id 42) at 14:32 UTC. If you meant a different alert or fire, tell me."
- Proceed.
If the alert has never fired in the lookback window, stop: there is nothing to investigate. Respond with:
"Alert '[name]' has not fired in the last 7d, so there is no fire window to investigate. Use
signoz-explaining-alertsto walk through the rule, or check whether the alert is enabled."
Workflow
The investigation runs in three tiers with strict early-stop gates. Tier 1 always runs. Tier 2 runs only if tier 1 confirms a real fire. Tier 3 runs only if tier 2 surfaces correlated anomalies. Skipping the gates produces hundreds of unnecessary trace/log queries on quiet alerts.
Step 1: Resolve alert + fire window (Tier 0)
- Resolve the alert id via
signoz:signoz_list_alert_rules(paginated) if not given. - Call
signoz:signoz_get_alertfor the full rule config — needed to know what query, threshold, and resource scope the alert evaluated. - Call
signoz:signoz_get_alert_historywith a 7d lookback. From the response:- Pick the fire window. Default to the most recent fire. If the
user passed an explicit time window via
$ARGUMENTS[1], honor it. - Note the fire pattern:
one-off→ single fire with a long quiet period before/after.sustained→ fires that stayed firing for ≥ 1 evaluation cycle.flapping→ ≥ 3 fires within a 1h window, alternating fire/resolve.recurring→ fires at regular intervals (cron-like, e.g., every hour).
- The pattern tells you what to expect from tiers 2/3.
- Pick the fire window. Default to the most recent fire. If the
user passed an explicit time window via
Step 2: Tier 1 — what fired and how hard
This tier always runs. It establishes the fire is real (vs. transient threshold tickle or flap) and quantifies the magnitude.
- Re-run the alert's primary query over a window centered on the fire
start:
[fire_start - 30m, fire_start + 30m].- Use
signoz:signoz_execute_builder_queryfor builder/formula alerts. - Use
signoz:signoz_query_metricsfor PromQL alerts.
- Use
- Compute:
- Peak value during the fire window.
- Threshold breach magnitude:
(peak - threshold) / threshold * 100for "above" alerts, inverted for "below". - Fire duration: how long the breach lasted.
- Pre-fire baseline: average value in the 30m before fire start.
- Early-stop gate: if the breach magnitude is < 10% over the threshold AND the fire duration is < 1 evaluation window, classify as "marginal fire" — the alert may be too sensitive. Skip tiers 2 and 3 and go to Step 5 with a single hypothesis: "threshold may be too tight, recommend tuning."
Step 3: Tier 2 — neighbor signals vs baseline
Run only if Tier 1 confirms a real breach. Pull related signals for the same resource scope as the alert and compare the fire window to a baseline window.
-
Pick a baseline window. Use the same hour, previous day (
fire_start - 24h, fire_start - 24h + fire_duration). If the alert fired during a known-anomalous time (deploy, weekly job), note it in the output but still proceed. -
Look up neighbor signals for the alert's resource type. See
references/neighbor-signals.mdfor the lookup table. Common cases:- Service-level alert (
service.name = X): pull error rate, p95/p99 latency, request throughput, dependency error rates if trace data is available. - Host / VM alert (
host.name = X): CPU, memory, disk I/O, network I/O. - K8s pod / namespace alert: pod restarts, container CPU/memory limits, node pressure, recent rollouts.
- Service-level alert (
-
For each neighbor signal:
- Query both windows (fire + baseline) via
signoz:signoz_execute_builder_queryorsignoz:signoz_query_metrics. - Compute the delta (% change in fire window vs baseline).
- Rank by absolute delta.
- Query both windows (fire + baseline) via
-
Early-stop gate: if no neighbor signal shows ≥ 25% deviation from baseline, classify as "isolated fire — the alert's own signal moved but nothing else did." This is unusual and worth surfacing. Skip Tier 3 and go to Step 5 with hypotheses focused on the alert's own query (likely causes: data source change, instrumentation change, downstream silent failure that only shows in this metric).
Step 4: Tier 3 — traces and logs at the fire window
Run only if Tier 2 found correlated neighbor anomalies. Drill into specific failing operations.
-
Traces (if the alert is service-scoped and traces are available):
- Call
signoz:signoz_search_tracesfor the fire window with filter:service.name = <scope>ANDhasError = true. Cap at top 20. - Group results by
name(operation) anderror_message. Surface the top 3 by frequency with a representative trace ID for each. - Optionally call
signoz:signoz_get_trace_detailson one representative to extract specific span attributes (DB statement, downstream URL, status code).
- Call
-
Logs for the fire window:
- Call
signoz:signoz_search_logswith filter:<scope_filter>ANDseverity_text IN ('ERROR', 'FATAL'). Cap at top 20 most recent. - Group by
bodypattern (orexception.typeif present). Surface the top 3 distinct messages with counts.
- Call
-
Cross-reference: do the traces and logs point at the same downstream service, dependency, or code path? If so, that becomes the leading hypothesis.
See references/baseline-comparison.md for query templates that pair
fire-window and baseline-window calls cleanly.
Step 5: Build the structured output
Use this exact section order. Keep each section tight — this is a report, not an essay.
1. What fired
One paragraph: the alert (id, name), the fire window (absolute and
relative time, e.g., "fired 2h ago at 14:32 UTC"), peak magnitude
("error rate hit 12.4% vs. 5% threshold — 148% over"), fire duration,
and the fire pattern (one-off / sustained / flapping /
recurring / marginal).
2. Likely causes (ranked, max 3) Each cause has three parts:
- Hypothesis — one sentence, specific. Bad: "service is unhealthy". Good: "checkout is timing out on calls to payments-api".
- Evidence — the supporting numbers from tiers 1/2/3, with the underlying query inline so the user can re-run it. State the neighbor signal, the delta vs baseline, the trace/log pattern that supports it.
- Confidence —
high(multi-tier evidence converges),medium(one tier's evidence),low(only one signal moved).
If only Tier 1 ran (marginal fire / no neighbor anomalies), output
fewer hypotheses with low confidence and explicitly call out the
limitation.
3. Suggested next steps Action items the user can take. Be concrete:
- Specific dashboard or trace to open (e.g., "open trace 7af3... in the SigNoz UI").
- Specific query to run with
signoz-generating-queriesorsignoz-writing-clickhouse-queries. - "Tune this alert" if the fire was marginal (link to
signoz:signoz_update_alert). - "Open an incident" or "page the owning team" if the cause is cross-service.
Out of scope (v1)
- Deployment / config-change correlation — SigNoz MCP does not expose a deployments tool; do not fabricate one. If the user mentions a recent deploy, surface it as context but don't claim it caused the fire without the signal evidence.
- Cross-service blast-radius walking — investigating downstream callers of the alert's service. Out of scope to keep context bounded.
- Long-horizon historical baselines — Tier 2 compares to one
prior-day window, not to weekly/monthly seasonality. If the user
says "is this normal for a Friday afternoon", suggest an anomaly
alert (
signoz-creating-alertswithanomaly_rule).
Guardrails
- Three-tier early-stop is mandatory. Skipping the gates pulls hundreds of traces/logs on quiet alerts and explodes context. The gates are not optional optimizations.
- Anchor every claim to an MCP query result. No speculation. If evidence is missing, lower confidence and say so.
- Show the supporting query with each hypothesis so the user can reproduce and dig deeper.
- Prefer resource-attribute filters in every drill-down query. This is the SigNoz MCP guideline and it directly affects query speed at scale.
- Do not modify any alert. Investigate is read-only. If the user
says "and tighten this alert", surface that as a next-step
recommendation; do not call
signoz:signoz_update_alert. - Stay in scope. Static rule explanation belongs to
signoz-explaining-alerts. Cause analysis without an alert anchor belongs tosignoz-generating-queries. - Time zones. Always state fire windows in UTC alongside relative time ("14:32 UTC, 2h ago") so autonomous and interactive consumers agree on the window.
Examples
User: "Why did the checkout error rate alert fire?"
Agent:
- Resolves alert: "High Error Rate — Checkout" (id 42).
signoz:signoz_get_alert_history→ most recent fire 2h ago at 14:32 UTC, sustained for 8m, single fire (not flapping).- Tier 1: re-runs error-rate formula over
[14:02, 15:02]. Peak error rate 12.4% (vs 5% threshold — 148% over). Pre-fire baseline 0.3%. Real fire, not marginal. - Tier 2: pulls neighbor signals for
service.name = checkout:- p99 latency: 4.1s vs 320ms baseline (+1180%).
- Throughput: -42% (drop).
- Downstream
paymentserror rate: 18% vs 0.2% baseline (+8900%). - CPU/memory: flat (no resource pressure).
- Tier 3: traces for
service.name = checkout, hasError = truein the fire window — top operationPOST /checkout/submit, top error message "context deadline exceeded calling payments-api". 30 traces, all hitting the same downstream URL. Logs show matching "payments client timeout" lines, 142 occurrences. - Output:
- What fired: alert 42 fired 2h ago at 14:32 UTC, sustained 8m. Error rate peaked at 12.4% (148% over threshold).
- Likely causes (high confidence): payments service errors cascading into checkout. Evidence: payments error rate +8900%, checkout p99 latency +1180%, 142 client-timeout logs and 30 error traces all calling payments-api. Single coherent root cause.
- Next steps: open trace 7af3...; check payments service
directly with
signoz-explaining-alertsif a payments alert exists, otherwisesignoz-generating-queriesfor payments error rate over last 4h.
User: "Investigate the flapping CPU alert on prod-api-3"
Agent:
- Resolves alert (id 88, host.name = prod-api-3).
- History: 7 fires in last 1h, alternating fire/resolve every 8-12 minutes — flapping pattern.
- Tier 1: peak 84% (threshold 80%, only 5% over). Each fire
lasted 2-4 minutes. Marginal — the value hovered near threshold.
matchType = at_least_oncemade each blip trigger. - Early-stop kicks in. Skip tiers 2/3.
- Output:
- What fired: alert 88 has been flapping — 7 fires in last 1h, each 2-4m. Peak only 5% over threshold.
- Likely causes (medium): threshold tuned too tight or
matchTypeis too sensitive. Evidence: every fire was within 5% of threshold; baseline already runs at 70-75%. Not a real anomaly, just flapping near the boundary. - Next steps: change
matchTypetoon_average(smooths transient spikes) OR raise threshold to 85% with hysteresis (recoveryTarget: 75). Usesignoz:signoz_update_alertto apply.
User: "RCA for the log volume alert that fired last night"
Agent:
- Resolves alert: "Error Log Volume Spike" (id 14, no service
filter — groupBy
service.name). - History: fired at 03:12 UTC, sustained 22m, broke down by service
in the alert annotations:
service.name = inventorywas the firing series. - Tier 1: re-runs log count for inventory in fire window. Peak 3,400 errors/min vs 1,000/min threshold (240% over). Pre-fire baseline 12/min. Real, large fire.
- Tier 2: neighbor signals for
service.name = inventory:- Request error rate: +600%.
- p99 latency: +30% (mild).
- CPU: -80% (collapsed). Memory: -60%.
- Pod restarts (k8s): 4 in fire window.
- Tier 3: logs for inventory in fire window. Top message: "OOMKilled restarting" (1,200 occurrences). Top trace error: graceful-shutdown exceptions.
- Output:
- What fired: alert 14 fired at 03:12 UTC for service inventory, sustained 22m, 240% over threshold.
- Likely causes (high): inventory pods OOM-killed and restarted 4 times during the window. Evidence: 1,200 OOM log lines, 4 pod restarts, CPU/memory dropped to zero between restarts, error logs spiked from restart noise rather than a true error rate change in the application.
- Next steps: check container memory limits for inventory pods; review recent deploys; consider whether the alert should exclude restart-related error patterns or whether the underlying OOM is the real concern.
Additional resources
references/neighbor-signals.md— lookup table mapping resource type (service / host / k8s) to the neighbor signals to pull in Tier 2.references/baseline-comparison.md— query templates that pair fire-window and baseline-window calls cleanly, including how to formatsignoz:signoz_execute_builder_queryfor both.signoz-explaining-alertsskill — to decode the rule before investigating, if the user is unfamiliar with what the alert monitors.signoz-writing-clickhouse-queriesskill — for drill-down queries that need raw ClickHouse SQL.signoz-generating-queriesskill — for ad-hoc follow-up queries on the same resource scope.