Kubernetes Blast Radius

Answers hypothetical and real node-failure questions with data. Categorizes every deployment touching the node into full-outage / degraded / unaffected, totals up memory at risk, and checks whether the remaining cluster has capacity to reschedule.

Prerequisites

Signal	Required?	What you get without it
Kubernetes (kubeletstats)	Required	Tool does not apply — suggest the user instrument with kubeletstats receiver.
Elastic APM	Optional	Core node-impact analysis still works. The `downstream_services` section (user-facing services in affected namespaces) is omitted with a note.

If the user is not running Kubernetes, this tool does not apply. But a Kubernetes-only customer (no APM) still gets the full pod-level impact assessment and rescheduling feasibility — the majority of the value.

Tools

Tool	Purpose
`k8s-blast-radius`	Run the impact assessment for a specific node.
`apm-health-summary`	Before: check which services are already degraded.
`apm-service-dependencies`	After: map downstream ripple for affected services.
`ml-anomalies`	After: is unusual behavior already showing up on affected workloads?

How to call k8s-blast-radius

{
  "node": "gke-prod-pool-1-abc123",
  "layout": "summary"
}

Parameter-filling guidance:

node: must be exact. Matched literally against kubernetes.node.name. If the user describes a node ambiguously ("the noisy node", "the one running frontend"), ask them to confirm the exact node name before calling. Do not guess.
layout: default summary (compact, collapsible sections). Use radial when the user wants a visual "impact-by-proximity" diagram.

After the tool returns

Response shape:

status: AT RISK (full outage), PARTIAL RISK (degraded only), or SAFE (no impact).
data_coverage: which backends contributed (always kubernetes: true; apm: true|false).
pods_at_risk: count of pods on the node.
full_outage[]: deployments losing all replicas — lead with these.
degraded[]: deployments losing partial capacity.
unaffected / unaffected_count: deployments not touching the node.
rescheduling: memory required vs available, and whether it's feasible.
downstream_services[] (only if APM present): user-facing services whose namespace is affected.
downstream_services_note (only if APM absent): explains the gap.
investigation_actions: next-step prompts surfaced as click-to-send buttons in the view (includes a SPOF callout when a single-replica deployment is implicated).
render_instructions: HTML render spec — let the inline MCP App view handle visualization (floating summary card, radial affected-deployment sweep, safe-zone arc, hover tooltips).

Narrate in this order:

Headline status: "AT RISK — 3 deployments lose all replicas if gke-prod-pool-1-abc123 goes offline."
Full outage list: name the deployments. These are the critical ones.
Degraded list: name them, note surviving replica counts.
Rescheduling feasibility: "Cluster has X GB available across N nodes to absorb Y GB required — safe / not safe / marginal."
Downstream services (if APM present): name the services in affected namespaces that might be user-visible.
Recommend action: for AT RISK + infeasible reschedule, "don't drain this node without scaling up." For PARTIAL RISK + feasible, "safe to drain with PodDisruptionBudgets in place."

Key principles

Hypothetical framing. Unless the node is actually down, always present results as "if X goes offline, then Y" — not as current reality.
Rescheduling feasibility is a heuristic. It compares memory only — doesn't account for CPU, storage, affinity rules, taints, or PodDisruptionBudgets. Note this caveat.
Full-outage >> degraded. A deployment with 1 replica on the node is a full outage; a deployment with 3 replicas losing 1 is degraded. Treat them very differently in recommendations.
Downstream services matter. Even if a deployment is degraded not down, user-facing services might see tail latency. Mention the downstream APM services.
Don't conflate "at risk" with "broken." The status reflects potential impact. The node may be fine.