incident-response-smart-fix
SKILL.md
Intelligent Issue Resolution with Multi-Agent Orchestration
Purpose
Systematic four-phase debugging and resolution pipeline that combines AI-assisted debugging tools with observability platforms to diagnose and resolve production issues.
When to Use
- Investigating production incidents or outages
- Debugging complex multi-service failures
- Performing root cause analysis on recurring issues
- Resolving regressions after deployments
When NOT to Use
- Simple bugs with obvious fixes
- Feature development without incidents
- Issues with no logs, traces, or reproduction steps
Four-Phase Workflow
Phase 1: Issue Analysis
Goal: Understand the full context of the failure.
- Collect error traces, logs, and reproduction steps
- Identify affected services and upstream/downstream impacts
- Check recent deployments, config changes, or dependency updates
- Establish timeline: when did it start? Is it intermittent?
Tools: Sentry, DataDog, OpenTelemetry, CloudWatch, structured logs
Phase 2: Root Cause Investigation
Goal: Isolate the exact failure mechanism.
- Deep code analysis around the failure point
- Run
git bisectto identify the introducing commit - Check dependency compatibility (version conflicts, breaking changes)
- Inspect state: database, cache, queue, external API responses
- Reproduce locally with minimal test case
Techniques:
- Distributed tracing to follow request flow across services
- Binary search through recent commits
- State inspection at each service boundary
Phase 3: Fix Implementation
Goal: Implement minimal, safe fix with test coverage.
- Write failing test that reproduces the bug
- Implement minimal fix (smallest change that resolves the issue)
- Add unit + integration tests for the fix
- Add edge case tests for related scenarios
- Follow production-safe practices (feature flags, gradual rollout)
Principle: Understand root cause before fixing symptoms.
Phase 4: Verification
Goal: Confirm fix resolves the issue without regressions.
- Run full regression suite
- Performance benchmarks (ensure no degradation)
- Security scan (if relevant)
- Deploy to staging, verify with production-like traffic
- Monitor for 24-48h after production deploy
Post-Incident
- Write blameless postmortem documenting timeline, root cause, fix
- Add monitoring/alerting for the failure mode
- Implement preventive measures (type checks, validation, static analysis)
- Update runbooks with new failure pattern
Success Metrics
- MTTR (Mean Time to Recovery) — reduced over time
- Recurrence rate — same issue should not repeat
- Blast radius — fix should not introduce new issues
- Detection time — improved monitoring catches issues earlier
Resources
resources/implementation-playbook.mdfor detailed patterns and examples.
Weekly Installs
1
Repository
sivag-lab/roth_mcpGitHub Stars
1
First Seen
6 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1