k8s-troubleshooter
Kubernetes Troubleshooter & Incident Response
Systematic approach to diagnosing and resolving Kubernetes issues in production environments.
Core Troubleshooting Workflow
Follow this systematic approach for any Kubernetes issue:
1. Gather Context
- What is the observed symptom?
- When did it start?
- What changed recently (deployments, config, infrastructure)?
- What is the scope (single pod, service, node, cluster)?
- What is the business impact (severity level)?
2. Initial Triage
Run cluster health check:
# Check node status and health
kubectl get nodes
# Find non-running pods across all namespaces
kubectl get pods -A --field-selector status.phase!=Running
# Check node resource usage
kubectl top nodes
This provides an overview of:
- Node health status
- Pending and failed pods across all namespaces
- Node resource utilization
3. Deep Dive Investigation
Based on triage results, focus investigation:
For Namespace-Level Issues:
python3 scripts/check_namespace.py <namespace>
This provides comprehensive namespace health:
- Pod status (running, pending, failed, crashlooping)
- Service health and endpoints
- Deployment availability
- PVC status
- Resource quota usage
- Recent events
- Actionable recommendations
For Pod Issues:
# Get full pod details (status, events, conditions, resource config)
kubectl describe pod <pod-name> -n <namespace>
# Check current and previous container logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
# Get events specific to the pod
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
This reveals:
- Pod phase and readiness
- Container statuses and states
- Restart counts and exit codes
- Recent events and scheduling decisions
- Resource requests and limits
For additional investigations:
- Check all namespace events:
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
4. Identify Root Cause
Consult references/common_issues.md for detailed information on:
- ImagePullBackOff / ErrImagePull
- CrashLoopBackOff
- Pending Pods
- OOMKilled
- Node issues (NotReady, DiskPressure)
- Networking failures
- Storage/PVC issues
- Resource quotas and throttling
- RBAC permission errors
Each issue includes:
- Symptoms
- Common causes
- Diagnostic commands
- Remediation steps
- Prevention strategies
5. Apply Remediation
Follow remediation steps from common_issues.md based on root cause identified.
Always:
- Test fixes in non-production first if possible
- Document actions taken
- Monitor for effectiveness
- Have rollback plan ready
6. Verify & Monitor
After applying fix:
- Verify issue is resolved
- Monitor for recurrence (15-30 minutes minimum)
- Check related systems
- Update documentation
Incident Response
For production incidents, follow structured response in references/incident_response.md:
Severity Assessment:
- SEV-1 (Critical): Complete outage, data loss, security breach
- SEV-2 (High): Major degradation, significant user impact
- SEV-3 (Medium): Minor impairment, workaround available
- SEV-4 (Low): Cosmetic, minimal impact
Incident Phases:
- Detection - Identify and assess
- Triage - Determine scope and impact
- Investigation - Find root cause
- Resolution - Apply fix
- Post-Incident - Document and improve
Common Incident Scenarios:
- Complete cluster outage
- Service degradation
- Node failure
- Storage issues
- Security incidents
See references/incident_response.md for detailed playbooks.
Quick Reference Commands
Cluster Overview
kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
Pod Diagnostics
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl get pod <pod> -n <namespace> -o yaml
Node Diagnostics
kubectl describe node <node>
kubectl top nodes
kubectl top pods --all-namespaces
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
Service & Network
kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>
kubectl get networkpolicies --all-namespaces
Storage
kubectl get pvc,pv --all-namespaces
kubectl describe pvc <pvc> -n <namespace>
kubectl get storageclass
Resource & Configuration
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get rolebindings,clusterrolebindings -n <namespace>
Diagnostic Scripts
check_namespace.py
Namespace-level health check and diagnostics:
- Pod health (running, pending, failed, crashlooping, image pull errors)
- Service health and endpoints
- Deployment availability status
- PersistentVolumeClaim status
- Resource quota usage and limits
- Recent namespace events
- Health status assessment
- Actionable recommendations
Usage:
# Human-readable output
python3 scripts/check_namespace.py <namespace>
# JSON output for automation
python3 scripts/check_namespace.py <namespace> --json
# Include more events
python3 scripts/check_namespace.py <namespace> --events 20
Best used when troubleshooting issues in a specific namespace or assessing overall namespace health.
Cluster-Level Diagnostics (kubectl)
For cluster-wide health checks, use kubectl directly:
# Node health and status
kubectl get nodes
kubectl top nodes
# Find non-running pods across all namespaces
kubectl get pods -A --field-selector status.phase!=Running
# System pod health
kubectl get pods -n kube-system
Pod-Level Diagnostics (kubectl)
For detailed pod investigation, use kubectl directly:
# Full pod details (status, events, conditions, resource config)
kubectl describe pod <pod-name> -n <namespace>
# Current and previous container logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
# Events specific to the pod
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
Reference Documentation
references/common_issues.md
Comprehensive guide to common Kubernetes issues with:
- Detailed symptom descriptions
- Root cause analysis
- Step-by-step diagnostic procedures
- Remediation instructions
- Prevention strategies
Covers:
- Pod issues (ImagePullBackOff, CrashLoopBackOff, Pending, OOMKilled)
- Node issues (NotReady, DiskPressure)
- Networking issues (pod-to-pod communication, service access)
- Storage issues (PVC pending, volume mount failures)
- Resource issues (quota exceeded, CPU throttling)
- Security issues (vulnerabilities, RBAC)
Read this when you identify a specific issue type but need detailed remediation steps.
references/incident_response.md
Structured incident response framework including:
- Incident response phases (Detection → Triage → Investigation → Resolution → Post-Incident)
- Severity level definitions
- Detailed playbooks for common incident scenarios
- Communication guidelines
- Post-incident review template
- Best practices for prevention, preparedness, response, and recovery
Read this when responding to production incidents or planning incident response procedures.
references/performance_troubleshooting.md
Comprehensive performance diagnosis and optimization guide covering:
- High Latency Issues - API response time, request latency troubleshooting
- CPU Performance - Throttling detection, profiling, optimization
- Memory Performance - OOM issues, leak detection, heap profiling
- Network Performance - Latency, packet loss, DNS resolution
- Storage I/O Performance - Disk performance testing, optimization
- Application-Level Metrics - Prometheus integration, distributed tracing
- Cluster-Wide Performance - Control plane, scheduler, resource utilization
Read this when:
- Investigating slow application response times
- Diagnosing CPU or memory performance issues
- Troubleshooting network latency or connectivity
- Optimizing storage I/O performance
- Setting up performance monitoring
references/helm_troubleshooting.md
Complete guide to Helm troubleshooting including:
- Release Issues - Stuck releases, missing resources, state problems
- Installation Failures - Chart conflicts, validation errors, template rendering
- Upgrade and Rollback - Failed upgrades, immutable field errors, rollback procedures
- Values and Configuration - Values not applied, parsing errors, secret handling
- Chart Dependencies - Dependency updates, version conflicts, subchart values
- Hooks and Lifecycle - Hook failures, cleanup issues
- Repository Issues - Chart access problems, version mismatches
Read this when:
- Working with Helm-deployed applications
- Troubleshooting chart installations or upgrades
- Debugging Helm release states
- Managing chart dependencies