k8s-blast-radius
Installation
SKILL.md
Kubernetes Blast Radius
Answers hypothetical and real node-failure questions with data. Categorizes every deployment touching the node into full-outage / degraded / unaffected, totals up memory at risk, and checks whether the remaining cluster has capacity to reschedule.
Prerequisites
| Signal | Required? | What you get without it |
|---|---|---|
| Kubernetes (kubeletstats) | Required | Tool does not apply — suggest the user instrument with kubeletstats receiver. |
| Elastic APM | Optional | Core node-impact analysis still works. The downstream_services section (user-facing services in affected namespaces) is omitted with a note. |
If the user is not running Kubernetes, this tool does not apply. But a Kubernetes-only customer (no APM) still gets the full pod-level impact assessment and rescheduling feasibility — the majority of the value.
Tools
| Tool | Purpose |
|---|---|
k8s-blast-radius |
Run the impact assessment for a specific node. |
apm-health-summary |
Before: check which services are already degraded. |
apm-service-dependencies |
After: map downstream ripple for affected services. |
ml-anomalies |
After: is unusual behavior already showing up on affected workloads? |
How to call k8s-blast-radius
{
"node": "gke-prod-pool-1-abc123",
"layout": "summary"
}
Parameter-filling guidance:
node: must be exact. Matched literally againstkubernetes.node.name. If the user describes a node ambiguously ("the noisy node", "the one running frontend"), ask them to confirm the exact node name before calling. Do not guess.layout: defaultsummary(compact, collapsible sections). Useradialwhen the user wants a visual "impact-by-proximity" diagram.
After the tool returns
Response shape:
status:AT RISK(full outage),PARTIAL RISK(degraded only), orSAFE(no impact).data_coverage: which backends contributed (alwayskubernetes: true;apm: true|false).pods_at_risk: count of pods on the node.full_outage[]: deployments losing all replicas — lead with these.degraded[]: deployments losing partial capacity.unaffected/unaffected_count: deployments not touching the node.rescheduling: memory required vs available, and whether it's feasible.downstream_services[](only if APM present): user-facing services whose namespace is affected.downstream_services_note(only if APM absent): explains the gap.investigation_actions: next-step prompts surfaced as click-to-send buttons in the view (includes a SPOF callout when a single-replica deployment is implicated).render_instructions: HTML render spec — let the inline MCP App view handle visualization (floating summary card, radial affected-deployment sweep, safe-zone arc, hover tooltips).
Narrate in this order:
- Headline status: "AT RISK — 3 deployments lose all replicas if gke-prod-pool-1-abc123 goes offline."
- Full outage list: name the deployments. These are the critical ones.
- Degraded list: name them, note surviving replica counts.
- Rescheduling feasibility: "Cluster has X GB available across N nodes to absorb Y GB required — safe / not safe / marginal."
- Downstream services (if APM present): name the services in affected namespaces that might be user-visible.
- Recommend action: for AT RISK + infeasible reschedule, "don't drain this node without scaling up." For PARTIAL RISK + feasible, "safe to drain with PodDisruptionBudgets in place."
Key principles
- Hypothetical framing. Unless the node is actually down, always present results as "if X goes offline, then Y" — not as current reality.
- Rescheduling feasibility is a heuristic. It compares memory only — doesn't account for CPU, storage, affinity rules, taints, or PodDisruptionBudgets. Note this caveat.
- Full-outage >> degraded. A deployment with 1 replica on the node is a full outage; a deployment with 3 replicas losing 1 is degraded. Treat them very differently in recommendations.
- Downstream services matter. Even if a deployment is degraded not down, user-facing services might see tail latency. Mention the downstream APM services.
- Don't conflate "at risk" with "broken." The status reflects potential impact. The node may be fine.