incident-response
SKILL.md
Incident Response and Remediation
Patterns for diagnosing and fixing production issues.
Healer Mode Workflow
- Investigate - Gather metrics, logs, and system state
- Diagnose - Identify root cause before fixing
- Fix - Implement minimal targeted fix
- Validate - Confirm metrics improve after deployment
- Document - Store learnings for future incidents
Tool Usage Priority
- Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
- Kubernetes Tools - Check pod status, events, deployments
- ArgoCD Tools - Verify GitOps sync status
- Memory Search - Look for similar past incidents
- Code Fix - Implement minimal targeted fix
Observability Queries
Prometheus Metrics
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)
# Memory usage
container_memory_working_set_bytes{pod=~"app-.*"}
Loki Log Queries
# Errors in last hour
{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"
# Stack traces
{namespace="production"} |= "panic" or |= "stack trace"
# Slow requests
{namespace="production"} | json | latency_ms > 1000
Kubernetes Diagnostics
# Pod status and events
kubectl get pods -n production -l app=myapp
kubectl describe pod <pod-name> -n production
kubectl get events -n production --sort-by='.lastTimestamp'
# Logs
kubectl logs -n production -l app=myapp --tail=100
kubectl logs -n production <pod-name> --previous # Previous container
# Resource usage
kubectl top pods -n production
kubectl top nodes
# Deployment status
kubectl rollout status deployment/myapp -n production
kubectl rollout history deployment/myapp -n production
ArgoCD Status
# Application status
argocd app get myapp
argocd app diff myapp
# Sync status
argocd app sync myapp --dry-run
# Rollback
argocd app rollback myapp <revision>
Common Issues and Solutions
High Error Rate
- Check recent deployments
- Review error logs for patterns
- Check dependency health
- Verify configuration changes
High Latency
- Check database query performance
- Review external service latency
- Check resource constraints (CPU/memory)
- Look for lock contention
OOMKilled Pods
- Increase memory limits
- Check for memory leaks
- Review recent code changes
- Consider horizontal scaling
CrashLoopBackOff
- Check logs for startup errors
- Verify secrets and configs exist
- Check health check endpoints
- Review recent deployments
ImagePullBackOff
- Verify image exists in registry
- Check image pull secrets
- Verify image tag is correct
- Check registry connectivity
Healing Guidelines
- Diagnose first - Understand the root cause before fixing
- Minimal changes - Fix only what's broken
- Document findings - Store learnings in memory for future incidents
- Validate fix - Confirm metrics improve after deployment
- Rollback if needed - Don't hesitate to rollback if fix doesn't work
Post-Incident
- Update metrics/alerts if needed
- Document root cause and fix
- Store learnings in memory for similar incidents
- Consider preventive measures
- Update runbooks if applicable
Weekly Installs
3
Repository
5dlabs/ctoFirst Seen
Jan 24, 2026
Installed on
claude-code2
windsurf1
trae1
opencode1
codex1
antigravity1