Incident Response and Remediation

Patterns for diagnosing and fixing production issues.

Healer Mode Workflow

Investigate - Gather metrics, logs, and system state
Diagnose - Identify root cause before fixing
Fix - Implement minimal targeted fix
Validate - Confirm metrics improve after deployment
Document - Store learnings for future incidents

Tool Usage Priority

Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
Kubernetes Tools - Check pod status, events, deployments
ArgoCD Tools - Verify GitOps sync status
Memory Search - Look for similar past incidents
Code Fix - Implement minimal targeted fix

Observability Queries

Prometheus Metrics

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

# Latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)

# Memory usage
container_memory_working_set_bytes{pod=~"app-.*"}

Loki Log Queries

# Errors in last hour
{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"

# Stack traces
{namespace="production"} |= "panic" or |= "stack trace"

# Slow requests
{namespace="production"} | json | latency_ms > 1000

Kubernetes Diagnostics

# Pod status and events
kubectl get pods -n production -l app=myapp
kubectl describe pod <pod-name> -n production
kubectl get events -n production --sort-by='.lastTimestamp'

# Logs
kubectl logs -n production -l app=myapp --tail=100
kubectl logs -n production <pod-name> --previous  # Previous container

# Resource usage
kubectl top pods -n production
kubectl top nodes

# Deployment status
kubectl rollout status deployment/myapp -n production
kubectl rollout history deployment/myapp -n production

ArgoCD Status

# Application status
argocd app get myapp
argocd app diff myapp

# Sync status
argocd app sync myapp --dry-run

# Rollback
argocd app rollback myapp <revision>

Common Issues and Solutions

High Error Rate

Check recent deployments
Review error logs for patterns
Check dependency health
Verify configuration changes

High Latency

Check database query performance
Review external service latency
Check resource constraints (CPU/memory)
Look for lock contention

OOMKilled Pods

Increase memory limits
Check for memory leaks
Review recent code changes
Consider horizontal scaling

CrashLoopBackOff

Check logs for startup errors
Verify secrets and configs exist
Check health check endpoints
Review recent deployments

ImagePullBackOff

Verify image exists in registry
Check image pull secrets
Verify image tag is correct
Check registry connectivity

Healing Guidelines

Diagnose first - Understand the root cause before fixing
Minimal changes - Fix only what's broken
Document findings - Store learnings in memory for future incidents
Validate fix - Confirm metrics improve after deployment
Rollback if needed - Don't hesitate to rollback if fix doesn't work

Post-Incident

Update metrics/alerts if needed
Document root cause and fix
Store learnings in memory for similar incidents
Consider preventive measures
Update runbooks if applicable

incident-response

Incident Response and Remediation

Healer Mode Workflow

Tool Usage Priority

Observability Queries

Prometheus Metrics

Loki Log Queries

Kubernetes Diagnostics

ArgoCD Status

Common Issues and Solutions

High Error Rate

High Latency

OOMKilled Pods

CrashLoopBackOff

ImagePullBackOff

Healing Guidelines

Post-Incident