debug-prod
Production Debugging
Systematic production incident investigation. Mitigation comes before root cause — stop the bleeding first, then investigate why.
When to Use
- Something is broken or degraded in production right now
- Users are reporting errors or unexpected behavior in a live environment
- An alert fired and you need to understand what's happening
- You want to do a structured post-incident investigation
Iron Laws
- Mitigate before you investigate. Reducing user impact takes priority over finding root cause.
- Preserve evidence before changing anything. Capture logs, traces, and metrics snapshots before rolling back or restarting.
- Never guess — instrument and observe. Form a hypothesis, then get evidence. Never propose a fix based on intuition alone.
- Reproduce before fixing. If you cannot trigger the problem reliably, you cannot verify the fix.
- Protect context window. All log output goes to a file. Never dump raw logs inline — pipe through
grep/tailand show only relevant lines. - Clean before fixing. Run
git restore .to remove all debug instrumentation before implementing the fix.
Phase 1 — Triage and Mitigation (First 5 Minutes)
Assess blast radius:
- How many users/requests are affected? (error rate, not absolute count)
- Which services, endpoints, or features are impacted?
- Is the issue getting worse, stable, or recovering on its own?
Immediate mitigation options — evaluate in order:
- Rollback recent deployment if the issue started after a deploy
- Feature flag to disable the broken feature
- Traffic reroute away from the affected service/instance
- Scale out if the issue is load-related
- Restart as a last resort (destroys evidence — capture logs first)
Preserve evidence before any mitigation:
# Save current error logs to file before rollback/restart
grep "ERROR\|FATAL\|Exception" <log-source> > /tmp/incident-evidence.txt
State your mitigation decision and rationale before proceeding to investigation.
Phase 2 — Signal Correlation (Minutes 5–30)
Work through the observability funnel: Metrics → Traces → Logs
Step 1: Metrics — What changed and when?
- Check error rate, latency (p50/p95/p99), and saturation dashboards
- Find the exact timestamp when the incident started
- Look for any correlated metric changes (memory, CPU, DB connections, cache hit rate)
- Cross-reference with deployment timestamps:
# Check recent deployments/commits relative to incident start time
git log --oneline --since="2 hours ago"
Step 2: Traces — Where is the failure?
- Pull distributed traces for failing requests from the incident window
- Find which service, operation, or span is erroring or slow
- Note the trace ID — use it to correlate logs across services
Step 3: Logs — What exactly happened?
- Use the trace ID to extract correlated logs:
grep "<trace-id>" <log-file> > /tmp/trace-logs.txt
cat /tmp/trace-logs.txt
- Find the exact error message, stack trace, and request payload
- Note the first occurrence timestamp — does it predate the alert?
Phase 3 — Root Cause Investigation (Minutes 30–120)
Step 1: Form Hypotheses
Generate 3–5 ranked hypotheses based on the signal correlation evidence. Format each as:
H1 [most likely]: <specific, testable claim about root cause>
Evidence for: <what supports this>
Evidence against: <what would refute this>
H2: ...
H3: ...
Step 2: Instrument at Layer Boundaries
For each active hypothesis, add targeted log instrumentation at the relevant layer boundary. Tag every debug log with the hypothesis number:
[DEBUG-PROD][H1-db-connection-pool] connections_active=47 connections_max=50
[DEBUG-PROD][H2-cache-miss] key="user:1234" result=miss latency_ms=450
All output goes to file — never inline:
# Route instrumented output to file
<your-service-start-command> 2>&1 | tee /tmp/debug-prod.log &
grep '\[DEBUG-PROD\]\[H1' /tmp/debug-prod.log
Heisenbug signal: If adding instrumentation changes the behavior, you likely have a race or timing issue — pivot to timing analysis.
Step 3: Trace Backward to Origin
Starting from the observed symptom, trace backward:
- What called the failing function/service?
- What was the input/state at each layer boundary?
- Keep asking "what triggered this?" until you reach the original cause
Stop at the original trigger, not just the first place the error is visible.
Step 4: Confirm the Root Cause
Before moving to fix:
- State the confirmed root cause in one sentence
- List the evidence that proves it
- Eliminate the other hypotheses with counter-evidence
If 3+ hypotheses have been tested and failed → you likely have a systemic/architectural issue. Stop patching and escalate.
Phase 4 — Fix
# Remove all debug instrumentation first
git restore .
# Implement the minimal fix targeting the root cause
# Run the reproduction case to confirm red → green
# Run full test suite for regressions
Commit message structure:
fix: <what was broken>
Root cause: <one sentence>
Evidence: <key signal that confirmed it>
Reproduction: <how to trigger it>
Fix: <what was changed and why>
Phase 5 — Recovery and Verification
- Gradually restore normal traffic while watching error rate and latency
- Confirm the alert resolves and metrics return to baseline
- Watch for 10+ minutes before declaring the incident resolved
Phase 6 — Findings Report
Generate a structured report at .claude/incidents/<date>-<slug>.md:
# Incident: <title>
**Date:** <date>
**Severity:** SEV1 / SEV2 / SEV3
**Duration:** <start time> → <end time> (~N minutes)
**Status:** Resolved / Monitoring / Ongoing
## Summary
<2–3 sentences: what happened, who was affected, how it was resolved>
## Timeline
| Time | Event |
|------|-------|
| HH:MM | Alert fired / issue first reported |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Incident resolved |
## Root Cause
<Clear statement of root cause>
## Contributing Factors
- <Factor 1>
- <Factor 2>
## Evidence
- <Metric/trace/log that confirmed root cause>
## Fix
<What was changed and why>
## Action Items
| Item | Owner | Due |
|------|-------|-----|
| Add monitoring for X | | |
| Update runbook for Y | | |
| Fix underlying issue Z | | |
For SEV1/SEV2, a full postmortem (5 Whys, blameless analysis) should follow within 24–48 hours.
Quick Reference
| Symptom Pattern | First Check |
|---|---|
| Error rate spike after deploy | git log recent commits, rollback candidate |
| Gradual latency increase | Database connections, N+1 queries, memory leak |
| Intermittent errors | Race condition, shared mutable state, async ordering |
| Single service down | Health checks, OOM, disk space, dependency timeouts |
| All services degraded | Shared dependency (DB, cache, message queue, DNS) |
| Only affecting some users | Feature flag state, A/B bucket, regional routing |
More from supatest-ai/supa-skills
bug
Investigate, reproduce, and fix a bug using a structured workflow. Captures a complete bug report, reproduces the issue reliably, finds the true root cause, writes a failing test before fixing, verifies the fix, and optionally files a well-formed GitHub issue.
1agent-readiness
Evaluate how ready a codebase is for autonomous AI agent work. Produces a scored report across 11 dimensions with a maturity level and prioritized action plan.
1work-summary
Generate a summary of work done based on git commit history across one or more repositories. Use for standups, timesheets, retrospectives, or any time you need a structured overview of what was accomplished in a time period.
1test-feature
Interactively test a completed feature using browser automation. Navigates the live app, exercises all key flows, captures a GIF video of the happy path, and produces a visual verification report. Use after implementing a new feature or making code changes that need testing verification.
1review-pr
Review a pull request or set of code changes. Filters to issues introduced by this PR (not pre-existing), scores confidence, and produces a prioritized report with severity tiers. Requires user approval before posting anything to GitHub.
1signoz-health-check
Check broad metrics and health status of SigNoz observability platform. Shows service health, error rates, latency, recent errors, and alerts. Use when you need to monitor system health or investigate issues.
1