k8s-troubleshooter
Kubernetes Troubleshooter & Incident Response
Systematic approach to diagnosing and resolving Kubernetes issues in production environments.
When to Use This Skill
Use this skill when:
- Investigating pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, etc.)
- Responding to production incidents or outages
- Troubleshooting cluster health issues
- Diagnosing networking or service connectivity problems
- Investigating storage/volume issues
- Analyzing performance degradation
- Conducting post-incident analysis
Core Troubleshooting Workflow
Follow this systematic approach for any Kubernetes issue:
1. Gather Context
- What is the observed symptom?
- When did it start?
- What changed recently (deployments, config, infrastructure)?
- What is the scope (single pod, service, node, cluster)?
- What is the business impact (severity level)?
2. Initial Triage
Run cluster health check:
python3 scripts/cluster_health.py
This provides an overview of:
- Node health status
- System pod health
- Pending pods
- Failed pods
- Crash loop pods
3. Deep Dive Investigation
Based on triage results, focus investigation:
For Namespace-Level Issues:
python3 scripts/check_namespace.py <namespace>
This provides comprehensive namespace health:
- Pod status (running, pending, failed, crashlooping)
- Service health and endpoints
- Deployment availability
- PVC status
- Resource quota usage
- Recent events
- Actionable recommendations
For Pod Issues:
python3 scripts/diagnose_pod.py <namespace> <pod-name>
This analyzes:
- Pod phase and readiness
- Container statuses and states
- Restart counts
- Recent events
- Resource usage
For specific investigations:
- Review pod details:
kubectl describe pod <pod> -n <namespace> - Check logs:
kubectl logs <pod> -n <namespace> - Check previous logs if restarting:
kubectl logs <pod> -n <namespace> --previous - Check events:
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
4. Identify Root Cause
Consult references/common_issues.md for detailed information on:
- ImagePullBackOff / ErrImagePull
- CrashLoopBackOff
- Pending Pods
- OOMKilled
- Node issues (NotReady, DiskPressure)
- Networking failures
- Storage/PVC issues
- Resource quotas and throttling
- RBAC permission errors
Each issue includes:
- Symptoms
- Common causes
- Diagnostic commands
- Remediation steps
- Prevention strategies
5. Apply Remediation
Follow remediation steps from common_issues.md based on root cause identified.
Always:
- Test fixes in non-production first if possible
- Document actions taken
- Monitor for effectiveness
- Have rollback plan ready
6. Verify & Monitor
After applying fix:
- Verify issue is resolved
- Monitor for recurrence (15-30 minutes minimum)
- Check related systems
- Update documentation
Incident Response
For production incidents, follow structured response in references/incident_response.md:
Severity Assessment:
- SEV-1 (Critical): Complete outage, data loss, security breach
- SEV-2 (High): Major degradation, significant user impact
- SEV-3 (Medium): Minor impairment, workaround available
- SEV-4 (Low): Cosmetic, minimal impact
Incident Phases:
- Detection - Identify and assess
- Triage - Determine scope and impact
- Investigation - Find root cause
- Resolution - Apply fix
- Post-Incident - Document and improve
Common Incident Scenarios:
- Complete cluster outage
- Service degradation
- Node failure
- Storage issues
- Security incidents
See references/incident_response.md for detailed playbooks.
Quick Reference Commands
Cluster Overview
kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
Pod Diagnostics
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl get pod <pod> -n <namespace> -o yaml
Node Diagnostics
kubectl describe node <node>
kubectl top nodes
kubectl top pods --all-namespaces
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
Service & Network
kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>
kubectl get networkpolicies --all-namespaces
Storage
kubectl get pvc,pv --all-namespaces
kubectl describe pvc <pvc> -n <namespace>
kubectl get storageclass
Resource & Configuration
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get rolebindings,clusterrolebindings -n <namespace>
Diagnostic Scripts
cluster_health.py
Comprehensive cluster health check covering:
- Node status and health
- System pod status (kube-system, etc.)
- Pending pods across all namespaces
- Failed pods
- Pods in crash loops
Usage: python3 scripts/cluster_health.py
Best used as first diagnostic step to get overall cluster health snapshot.
check_namespace.py
Namespace-level health check and diagnostics:
- Pod health (running, pending, failed, crashlooping, image pull errors)
- Service health and endpoints
- Deployment availability status
- PersistentVolumeClaim status
- Resource quota usage and limits
- Recent namespace events
- Health status assessment
- Actionable recommendations
Usage:
# Human-readable output
python3 scripts/check_namespace.py <namespace>
# JSON output for automation
python3 scripts/check_namespace.py <namespace> --json
# Include more events
python3 scripts/check_namespace.py <namespace> --events 20
Best used when troubleshooting issues in a specific namespace or assessing overall namespace health.
diagnose_pod.py
Detailed pod-level diagnostics:
- Pod phase and status
- Container states (waiting, running, terminated)
- Restart counts and patterns
- Resource configuration issues
- Recent events
- Actionable recommendations
Usage: python3 scripts/diagnose_pod.py <namespace> <pod-name>
Best used when investigating specific pod failures or behavior.
Reference Documentation
references/common_issues.md
Comprehensive guide to common Kubernetes issues with:
- Detailed symptom descriptions
- Root cause analysis
- Step-by-step diagnostic procedures
- Remediation instructions
- Prevention strategies
Covers:
- Pod issues (ImagePullBackOff, CrashLoopBackOff, Pending, OOMKilled)
- Node issues (NotReady, DiskPressure)
- Networking issues (pod-to-pod communication, service access)
- Storage issues (PVC pending, volume mount failures)
- Resource issues (quota exceeded, CPU throttling)
- Security issues (vulnerabilities, RBAC)
Read this when you identify a specific issue type but need detailed remediation steps.
references/incident_response.md
Structured incident response framework including:
- Incident response phases (Detection → Triage → Investigation → Resolution → Post-Incident)
- Severity level definitions
- Detailed playbooks for common incident scenarios
- Communication guidelines
- Post-incident review template
- Best practices for prevention, preparedness, response, and recovery
Read this when responding to production incidents or planning incident response procedures.
references/performance_troubleshooting.md
Comprehensive performance diagnosis and optimization guide covering:
- High Latency Issues - API response time, request latency troubleshooting
- CPU Performance - Throttling detection, profiling, optimization
- Memory Performance - OOM issues, leak detection, heap profiling
- Network Performance - Latency, packet loss, DNS resolution
- Storage I/O Performance - Disk performance testing, optimization
- Application-Level Metrics - Prometheus integration, distributed tracing
- Cluster-Wide Performance - Control plane, scheduler, resource utilization
Read this when:
- Investigating slow application response times
- Diagnosing CPU or memory performance issues
- Troubleshooting network latency or connectivity
- Optimizing storage I/O performance
- Setting up performance monitoring
references/helm_troubleshooting.md
Complete guide to Helm troubleshooting including:
- Release Issues - Stuck releases, missing resources, state problems
- Installation Failures - Chart conflicts, validation errors, template rendering
- Upgrade and Rollback - Failed upgrades, immutable field errors, rollback procedures
- Values and Configuration - Values not applied, parsing errors, secret handling
- Chart Dependencies - Dependency updates, version conflicts, subchart values
- Hooks and Lifecycle - Hook failures, cleanup issues
- Repository Issues - Chart access problems, version mismatches
Read this when:
- Working with Helm-deployed applications
- Troubleshooting chart installations or upgrades
- Debugging Helm release states
- Managing chart dependencies
Best Practices
Always:
- Start with high-level health check before deep diving
- Document symptoms and findings as you investigate
- Check recent changes (deployments, config, infrastructure)
- Preserve logs and state before making destructive changes
- Test fixes in non-production when possible
- Monitor after applying fixes to verify resolution
Never:
- Make production changes without understanding impact
- Delete resources without confirming they're safe to remove
- Restart pods repeatedly without investigating root cause
- Apply fixes without documentation
- Skip post-incident review
Key Principles:
- Systematic over random troubleshooting
- Evidence-based diagnosis
- Fix root cause, not symptoms
- Learn and improve from each incident
- Prevention is better than reaction