incident-hotfix
SKILL.md
Incident Hotfix
Structured incident response, hotfix, and postmortem process.
Severity Levels
| Level | Impact | Response Time | Examples |
|---|---|---|---|
| P0 | System down | Immediate | Complete outage, data loss |
| P1 | Major feature broken | < 1 hour | Auth broken, payments failing |
| P2 | Feature degraded | < 4 hours | Slow performance, partial outage |
| P3 | Minor issue | < 24 hours | UI bug, non-critical error |
Incident Response
1. Assess
- Identify the symptom
- Determine severity level
- Check monitoring dashboards
- Review recent deployments
- Communicate status
2. Isolate
- Identify affected components
- Check error logs
- Review recent changes
- Determine blast radius
3. Mitigate
Choose one:
Rollback (safest)
- Revert to last known good state
- Apply when cause is unclear
Hotfix (targeted)
- Minimal change to fix issue
- Apply when cause is clear and fix is simple
Feature Flag (quick)
- Disable problematic feature
- Apply when feature is isolatable
4. Verify
- Issue resolved
- Error rates normalized
- Performance restored
- No side effects
5. Communicate
- Update status page
- Notify stakeholders
- Document timeline
Hotfix Process
1. Create Hotfix Branch
# From production/main branch
git checkout main
git pull origin main
git checkout -b hotfix/issue-description
2. Minimal Fix
Rules for hotfix code:
- Smallest possible change
- No refactoring
- No unrelated changes
- Must pass tests
3. Verify
bun run lint:fix
bun run build
bun run test
# Test the specific fix
# Verify in staging if time permits
4. Deploy
# Merge to main
git checkout main
git merge hotfix/issue-description
# Tag and deploy
git tag -a v1.2.4 -m "Hotfix: issue description"
git push origin main --tags
5. Backport
# Merge hotfix to develop branch
git checkout develop
git merge hotfix/issue-description
git push origin develop
Postmortem Template
Write a postmortem within 48 hours of resolution.
# Incident Postmortem: [Title]
## Summary
| Field | Value |
|-------|-------|
| Date | YYYY-MM-DD |
| Duration | X hours Y minutes |
| Severity | P0/P1/P2/P3 |
| Author | [Name] |
## Impact
- [Number of users affected]
- [Revenue impact if applicable]
- [Other business impact]
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | Issue first detected |
| HH:MM | Team alerted |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Issue resolved |
## Root Cause
[Clear, technical explanation of what caused the incident]
## Detection
How was the incident detected?
- [ ] Monitoring alert
- [ ] Customer report
- [ ] Internal discovery
Could we have detected it earlier?
- [Analysis]
## Resolution
What fixed the issue?
- [Description of fix]
Was it a rollback or hotfix?
- [Details]
## Lessons Learned
### What Went Well
- [Point 1]
- [Point 2]
### What Went Wrong
- [Point 1]
- [Point 2]
### Where We Got Lucky
- [Point 1]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action 1] | [Name] | YYYY-MM-DD | [ ] |
| [Action 2] | [Name] | YYYY-MM-DD | [ ] |
## Prevention
How do we prevent this class of issue?
- [ ] Add monitoring for [X]
- [ ] Add test for [Y]
- [ ] Improve process for [Z]
Communication Templates
Initial Alert
🚨 [P1] Investigating issues with [service]
Impact: [Brief description]
Status: Investigating
ETA: Unknown
Updates to follow.
Update
🔄 [P1] Update on [service] issue
Status: Root cause identified, fix in progress
Impact: [Updated impact]
ETA: [Time estimate]
Next update in [X] minutes.
Resolution
✅ [P1] Resolved: [service] issue
Duration: [X hours Y minutes]
Resolution: [Brief description]
Postmortem to follow within 48 hours.
Weekly Installs
5
Repository
gentamura/dotfilesFirst Seen
Mar 1, 2026
Security Audits
Installed on
opencode5
gemini-cli5
github-copilot5
codex5
kimi-cli5
amp5