incident-management
Incident Management
Incident Severity
| Level | Impact | Response Time |
|---|---|---|
| SEV1 | Complete outage | Immediate |
| SEV2 | Major degradation | < 15 min |
| SEV3 | Minor degradation | < 1 hour |
| SEV4 | Low impact | Next business day |
Incident Response
1. Detect
- Monitoring alerts
- Customer reports
- Error logs
2. Triage
- Assess severity
- Assign incident commander
- Create communication channel
3. Investigate
- Check recent changes
- Review logs and metrics
- Identify root cause
4. Mitigate
- Apply quick fix
- Rollback if needed
- Communicate status
5. Resolve
- Confirm fix
- Monitor for recurrence
- Close incident
6. Learn
- Post-mortem meeting
- Document findings
- Create action items
Post-Mortem Template
# Post-Mortem: [Incident Title]
## Summary
[Brief description of what happened]
## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]
## Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]
## Root Cause
[What caused this incident]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## What Went Well
- [Positive 1]
- [Positive 2]
## What Could Be Improved
- [Improvement 1]
- [Improvement 2]
## Action Items
- [ ] [Action 1] - Owner: [Name]
- [ ] [Action 2] - Owner: [Name]
Blameless Culture
- Focus on systems, not people
- "What failed?" not "Who failed?"
- Share learnings openly
- Celebrate near-misses
More from nguyenhuuca/assessment
compliance
Ensure regulatory compliance. Use when implementing GDPR, HIPAA, PCI-DSS, or SOC2 requirements. Covers compliance frameworks and controls.
17requirements-analysis
Analyze and refine product requirements. Use when clarifying scope, identifying gaps, or validating requirements. Covers requirement types and analysis techniques.
16security-review
Conduct security code reviews. Use when reviewing code for vulnerabilities, assessing security posture, or auditing applications. Covers security review checklist.
13identity-access
Implement identity and access management. Use when designing authentication, authorization, or user management. Covers OAuth2, OIDC, and RBAC.
12execution-roadmaps
Create execution roadmaps for projects. Use when planning multi-phase projects or feature rollouts. Covers phased delivery and milestone planning.
12cloud-native-patterns
Apply cloud-native architecture patterns. Use when designing for scalability, resilience, or cloud deployment. Covers microservices, containers, and distributed systems.
12