APM Health Summary

This is the first tool to reach for in vague-symptom investigations — "something feels off, where should I look?" It gives you a one-shot rollup: degraded services, top resource consumers, active anomalies, and a data_coverage report showing what backends contributed. From there, you pick the right follow-up tool.

Prerequisites

Signal	Required?	What happens without it
Elastic APM	Required	Tool returns a warning and suggests `ml-anomalies`/`observe`/`manage-alerts` instead.
Kubernetes (kubeletstats)	Optional	`pods` section is replaced by a note; service health still reported.
ML anomaly jobs	Optional	`anomalies` section is replaced by a note; service health still reported.

If the user is log-only or metrics-only (no APM), do not call this tool. Suggest ml-anomalies (for ML-backed anomaly detection) or observe / manage-alerts (both universal).

Tools

Tool	Purpose
`apm-health-summary`	The rollup. First call in most investigations.
`ml-anomalies`	Drill into anomalies flagged in the summary.
`apm-service-dependencies`	Map topology around any degraded service.
`k8s-blast-radius`	If the summary implicates a node (pod resource pressure), assess node impact.
`observe`	Post-investigation: observe for stabilization or follow-on anomalies.

How to call apm-health-summary

{
  "namespace": "otel-demo",
  "lookback": "15m"
}

namespace: only if the user scopes to a K8s namespace. Omit for cross-namespace or non-K8s.
lookback: default 15m. Use 5m for "right now," 1h for "since I noticed the issue."
job_filter: optional ML-job prefix, e.g. k8s-. Rarely needed.
exclude_entities: optional wildcard to hide known noise, e.g. chaos-*.

After the tool returns

The tool renders an inline MCP App view — status badge, stat cards, anomaly-severity donut, top memory pods, service throughput list, and a next-step button row driven by investigation_actions. Use the view for the visual rollup; narrate findings below it.

Inspect data_coverage first — this tells you which signals contributed.

Then walk the output top-down:

Overall health (healthy / degraded / critical): lead with this.
Degraded services: name them with reasons (error rate, latency). These are the investigation targets.
Pods (if present): top memory consumers — cross-reference with degraded services.
Anomalies (if present): by-severity counts + top entities. Drives the ML follow-up.
Next-step buttons: the view surfaces investigation_actions as clickable prompts (drill into the top pod, investigate the degraded service, check blast radius). Mention them in chat so the user knows.

Based on what you see, pick the next tool:

Degraded service named → apm-service-dependencies with service: <name> to map the neighborhood.
High anomaly count → ml-anomalies with matching lookback to drill in.
Pod resource pressure on a specific node → k8s-blast-radius with that node name.

Key principles

Start here, then narrow. Don't guess which service is the problem — let the rollup tell you.
Respect data_coverage. If K8s is absent, don't suggest k8s-blast-radius. If APM is absent, don't call this tool at all.
The overall health is coarse. "Healthy" doesn't mean nothing is wrong — it means nothing meets the degraded thresholds. Always scan the details.
Graceful degradation is by design. APM-only output is still useful — don't apologize for missing K8s or ML signals; just report what you have.