Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

When to Apply

Use this skill when:

User mentions: "incident", "outage", "emergency", "down", "not working"
Operations: emergency response, production issues, service degradation
Keywords: "urgent", "broken", "fix", "restore", "recover"

Priority Rules

Priority	Rule	Impact	Tools
1	Check control plane first	CRITICAL	`get_pods(namespace="kube-system")`
2	Assess node health	CRITICAL	`get_nodes`
3	Gather events before changes	HIGH	`get_events`
4	Document timeline	HIGH	Manual notes
5	Rollback if safe	MEDIUM	`rollback_deployment`

Quick Reference

Incident	First Tool	Next Steps
Pod failure	`get_pod_logs(previous=True)`	`describe_pod`, `get_events`
Node down	`describe_node`	Check kubelet logs
Service unreachable	`get_endpoints`	`get_network_policies`
Control plane	`get_pods(namespace="kube-system")`	Check API server logs

Incident Triage

Quick Health Check

get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)

Severity Assessment

Indicator	Severity	Action
Multiple nodes NotReady	Critical	Escalate immediately
kube-system pods failing	Critical	Control plane issue
Single pod CrashLoop	Medium	Debug pod
High latency	Medium	Check resources

Runbook: Pod Failures

CrashLoopBackOff

get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)

Common Causes:

OOMKilled → Increase memory limits
Exit code 1 → Application error in logs
Exit code 137 → Killed by OOM or SIGKILL
Exit code 143 → Graceful SIGTERM

ImagePullBackOff

describe_pod(name, namespace)
get_secrets(namespace)

Pending Pod

describe_pod(name, namespace)
get_nodes()
get_events(namespace)

Runbook: Node Issues

Node NotReady

describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")

Node DiskPressure

describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")

Runbook: Network Issues

Service Not Accessible

get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)

DNS Resolution Failures

get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")

With Cilium

cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)

With Istio

istio_analyze_tool(namespace)
istio_proxy_status_tool()

Runbook: Storage Issues

PVC Pending

describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)

Pod Stuck in ContainerCreating

describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)

Runbook: Control Plane Issues

API Server Unavailable

get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")

etcd Issues

get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")

Emergency Actions

Force Delete Pod

delete_pod(name, namespace, grace_period=0, force=True)

Rollback Deployment

rollback_deployment(name, namespace, revision=0)

Helm Rollback

rollback_helm_release(name, namespace, revision=1)

Diagnostic Collection Script

For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.

Multi-Cluster Incident Response

Check all clusters:

for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)

Post-Incident

Document Timeline

When did the incident start?
What was the impact?
What was the root cause?
What fixed it?

Prevent Recurrence

Add monitoring/alerting
Improve resource limits
Add readiness probes
Document runbook

Related Skills

k8s-troubleshoot - Detailed debugging
k8s-security - Security incidents

k8s-incident

Kubernetes Incident Response

When to Apply

Priority Rules

Quick Reference

Incident Triage

Quick Health Check

Severity Assessment

Runbook: Pod Failures

CrashLoopBackOff

ImagePullBackOff

Pending Pod

Runbook: Node Issues

Node NotReady

Node DiskPressure

Runbook: Network Issues

Service Not Accessible

DNS Resolution Failures

With Cilium

With Istio

Runbook: Storage Issues

PVC Pending

Pod Stuck in ContainerCreating

Runbook: Control Plane Issues

API Server Unavailable

etcd Issues

Emergency Actions

Force Delete Pod

Rollback Deployment

Helm Rollback

Diagnostic Collection Script

Multi-Cluster Incident Response

Post-Incident

Document Timeline

Prevent Recurrence

Related Skills