NYC
skills/bobmatnyc/claude-mpm-skills/emergency-release-workflow

emergency-release-workflow

SKILL.md

Emergency Release Workflow Skill

Summary

Fast-track workflow for critical production issues requiring immediate deployment. Covers urgency assessment, expedited PR process, deployment verification, and post-incident analysis.

When to Use

  • Critical production bugs affecting users
  • Security vulnerabilities (CVEs)
  • Urgent business requirements
  • Data integrity issues
  • Service outages
  • Payment processing failures

Urgency Assessment

Priority Levels

Level Type Response Time Deployment Example
P0 Security vulnerability < 2 hours Immediate to production Auth bypass, data leak, active exploit
P1 Production down < 4 hours Same day App crash, complete feature failure, payment down
P2 Major bug < 24 hours Next business day Critical feature broken, significant user impact
P3 Business critical < 48 hours Scheduled release Marketing campaign blocker, partner deadline

P0 Criteria (Immediate Action)

  • Authentication/authorization bypass
  • Data breach or exposure
  • Remote code execution vulnerability
  • Production service completely unavailable
  • Data corruption affecting multiple users
  • Payment processing completely broken

P1 Criteria (Same Day)

  • Critical feature completely broken
  • Error affecting majority of users
  • Revenue-impacting bug
  • Database connectivity issues
  • Third-party integration failure (critical service)

P2 Criteria (Next Business Day)

  • Major feature partially broken
  • Affects specific user segment
  • Workaround available but not ideal
  • Performance degradation (not complete failure)

Emergency Release Process

1. Create Hotfix Branch

# Branch from current production (main)
git checkout main
git pull origin main

# Create hotfix branch
git checkout -b hotfix/ENG-XXX-brief-description

# Example:
git checkout -b hotfix/ENG-1234-fix-auth-bypass

2. Implement Minimal Fix

⚠️ CRITICAL: Minimal change only

DO:
✅ Fix the immediate issue
✅ Add regression test
✅ Document root cause in comments

DON'T:
❌ Refactor surrounding code
❌ Fix unrelated issues
❌ Add new features
❌ Update dependencies (unless that's the fix)

3. Test Thoroughly

# Run full test suite
pnpm test

# Type check
pnpm tsc --noEmit

# Build verification
pnpm build

# Manual testing checklist:
# - [ ] Reproduce original issue
# - [ ] Verify fix resolves issue
# - [ ] Test happy path
# - [ ] Test edge cases
# - [ ] Verify no new issues introduced

4. Create PR with Labels

git add .
git commit -m "fix: [brief description of fix]

Fixes critical issue where [description].
Root cause: [explanation].

Ticket: ENG-XXX
Priority: P0"

git push origin hotfix/ENG-XXX-brief-description

Use clear labels in PR title:

  • [RELEASE] - Direct to production
  • [HOTFIX] - Critical fix, expedited review
  • [P0] or [P1] - Priority indicator

PR Template for Hotfixes

Hotfix PR Template

## 🚨 [RELEASE] ENG-XXX: Brief description of fix

### Urgency
- [x] P0 - Security vulnerability
- [ ] P1 - Production down
- [ ] P2 - Major bug
- [ ] P3 - Business critical

### Impact
**Users affected**: [All users / Premium tier / Specific region / etc.]

**Severity**: [Choose one]
- [ ] Service completely unavailable
- [ ] Critical feature broken
- [ ] Security vulnerability
- [ ] Data integrity issue
- [ ] Degraded performance

**User impact**:
Describe how this affects end users.

### Root Cause
[Brief explanation of what caused the issue]

**How it happened:**
1. [Step 1]
2. [Step 2]
3. [Result: issue manifested]

**Why it wasn't caught:**
- [ ] Missing test coverage
- [ ] Race condition in production
- [ ] External service behavior changed
- [ ] Recent deployment introduced regression
- [ ] Other: [explain]

### The Fix
[What this PR changes to resolve the issue]

**Changes made:**
- Modified `file.ts` to [specific change]
- Added validation for [specific case]
- Fixed logic in [specific function]

**Why this fixes it:**
[Explanation of how the change resolves the root cause]

### Testing
- [ ] ✅ Reproduced issue locally
- [ ] ✅ Verified fix resolves issue
- [ ] ✅ Regression test added
- [ ] ✅ No other functionality affected
- [ ] ✅ Tested edge cases
- [ ] ✅ Deployed to staging and verified

### Regression Test
```typescript
// Test added to prevent recurrence
describe('ENG-XXX: Auth bypass fix', () => {
  it('should reject expired tokens', async () => {
    const expiredToken = generateExpiredToken();
    const response = await fetch('/api/protected', {
      headers: { Authorization: `Bearer ${expiredToken}` }
    });
    expect(response.status).toBe(401);
  });
});

Rollback Plan

If this causes issues:

# Option 1: Revert commit
git revert <commit-hash>
git push origin main

# Option 2: Deploy previous version
vercel rollback  # or your platform's rollback command

# Option 3: Feature flag
Set FEATURE_FIX_XXX=false in environment

Monitoring:

  • Error rate in Sentry
  • API response times in monitoring dashboard
  • User reports in support channels

Deploy Checklist

  • PR approved by at least one reviewer (waive for P0 if necessary)
  • All CI checks pass
  • Deployed to staging and verified
  • Monitoring alerts configured
  • On-call engineer notified
  • Ready for production deployment

Post-Deploy Verification

Immediately after deploy:

  • Verify fix in production (test endpoint directly)
  • Check error tracking (Sentry, etc.)
  • Monitor for new errors
  • Confirm user reports stop coming in

Metrics to watch:

  • Error rate (should drop)
  • API latency (should remain stable)
  • User activity (should normalize)

Follow-Up

  • Update Linear ticket with resolution
  • Schedule post-incident review (if P0/P1)
  • Create tickets for proper fix (if this was a band-aid)
  • Update runbook/documentation

---

## Deployment Steps

### Pre-Deployment
```bash
# 1. Merge PR to main
# (After approval or P0 emergency waiver)

# 2. Pull latest
git checkout main
git pull origin main

# 3. Verify commit
git log -1
# Confirm this is your hotfix commit

# 4. Tag release (if using semantic versioning)
git tag -a v2.3.5 -m "Hotfix: Fix auth bypass vulnerability"
git push origin v2.3.5

Deployment (Platform-Specific)

Vercel

# Trigger production deployment
vercel --prod

# Or use Vercel dashboard:
# Deployments → Select commit → Deploy to Production

# Monitor deployment
vercel logs --follow

Netlify

# Deploy via CLI
netlify deploy --prod

# Or trigger from dashboard:
# Deploys → Select commit → Publish deploy

Railway

# Push to main triggers deployment automatically
# Monitor in dashboard: railway.app/project/logs

AWS/GCP/Azure

# Follow platform-specific deployment process
# Example for AWS Elastic Beanstalk:
eb deploy production --staged

# Monitor:
eb logs --follow

Post-Deployment Verification

1. Smoke Test

# Test the specific fix
curl -X POST https://api.production.com/auth/login \
  -H "Content-Type: application/json" \
  -d '{"token": "expired_token"}'

# Expected: 401 Unauthorized

2. Monitor Error Tracking

✅ Check Sentry/Rollbar/etc.:
- Error rate should drop
- No new errors introduced

⏱️ Monitor for 15-30 minutes after deployment

3. Verify Metrics

Check monitoring dashboard:
- API response times (should be normal)
- Error rates (should drop)
- Database performance (should be stable)
- Third-party service health

4. Check User Reports

Monitor support channels:
- Support tickets
- In-app chat
- Social media
- Status page comments

Communication

Internal Communication

Slack/Teams Message Template

🚨 **Production Hotfix Deployed**

**Issue**: [Brief description]
**Ticket**: ENG-XXX
**Priority**: P0
**Status**: ✅ Resolved

**Timeline:**
- Issue discovered: 14:23 UTC
- Fix deployed: 15:47 UTC
- Duration: 1h 24m

**Impact**:
[Who was affected and how]

**Root Cause**:
[Brief explanation]

**Fix**:
[What was changed]

**Verification**:
✅ Error rate dropped from 450/min to 0/min
✅ All systems operating normally

**PR**: https://github.com/org/repo/pull/XXX

**Follow-up**:
- [ ] Post-incident review scheduled for [date]
- [ ] Documentation updated

External Communication (if needed)

Status Page Update

🟢 Resolved - [Issue Title]

We've resolved an issue that was affecting [feature/service].

**What happened:**
Between 14:23 and 15:47 UTC, users experienced [specific issue].

**Current status:**
The issue has been fully resolved. All systems are operating normally.

**Next steps:**
We're conducting a thorough review to prevent similar issues in the future.

We apologize for any inconvenience.

Email to Affected Users (for serious issues)

Subject: Update on [Service] Issue - Resolved

Hi [User],

We're writing to update you on an issue that affected [feature/service] earlier today.

**What happened:**
Between [time] and [time], you may have experienced [specific issue].

**Resolution:**
Our team quickly identified and resolved the root cause. The service is now operating normally.

**What we're doing:**
We take these issues seriously and are:
- Conducting a full review of the incident
- Implementing additional safeguards
- Improving our monitoring

We apologize for any inconvenience this may have caused.

If you have any questions or concerns, please reach out to support@company.com.

Thank you for your patience.

The [Company] Team

Post-Incident

Immediate Actions (Within 24 hours)

  • Update Linear ticket with full resolution details
  • Add incident to incident log/spreadsheet
  • Document timeline of events
  • Identify metrics that should have alerted earlier
  • Create follow-up tickets for proper fix (if hotfix was temporary)

Post-Incident Review (PIR) - For P0/P1

Schedule within 72 hours

PIR Template

# Post-Incident Review: [ENG-XXX]
Date: YYYY-MM-DD
Severity: P0/P1
Duration: Xh Xm

## Summary
Brief description of the incident.

## Timeline (UTC)
- 14:23 - Issue first detected
- 14:25 - On-call engineer alerted
- 14:30 - Root cause identified
- 14:45 - Fix PR opened
- 15:20 - PR approved and merged
- 15:47 - Fix deployed to production
- 16:00 - Verified resolved

## Impact
- **Users affected**: ~5,000 users
- **Duration**: 1h 24m
- **User experience**: Unable to log in
- **Revenue impact**: Estimated $X in lost transactions
- **Reputation impact**: 23 support tickets, 5 social media mentions

## Root Cause
Detailed technical explanation of what caused the issue.

[Include code snippets, sequence diagrams if helpful]

## Resolution
What was changed to fix the issue.

## What Went Well
- ✅ Fast detection (2 minutes after deploy)
- ✅ Clear reproduction steps identified quickly
- ✅ Team collaborated effectively
- ✅ Fix deployed in under 90 minutes

## What Went Wrong
- ❌ Missing test coverage for expired token edge case
- ❌ Staging didn't catch the issue (different token expiry settings)
- ❌ No automatic rollback triggered
- ❌ Monitoring alert threshold too high

## Action Items
- [ ] ENG-XXX-1: Add test for expired token validation (@engineer, by YYYY-MM-DD)
- [ ] ENG-XXX-2: Align staging token expiry with production (@devops, by YYYY-MM-DD)
- [ ] ENG-XXX-3: Implement automatic rollback on error spike (@platform, by YYYY-MM-DD)
- [ ] ENG-XXX-4: Lower monitoring alert threshold (@observability, by YYYY-MM-DD)
- [ ] ENG-XXX-5: Add runbook for similar issues (@oncall, by YYYY-MM-DD)

## Prevention
How we'll prevent this from happening again:

1. **Testing**: Add test coverage for edge cases
2. **Monitoring**: Improve alerting thresholds
3. **Process**: Update deployment checklist
4. **Documentation**: Create runbook for on-call

## Lessons Learned
Key takeaways for the team.

Backport Strategy

When fix needs to go to multiple branches/environments:

Multiple Environment Deployment

# 1. Fix applied to main (production)
git checkout main
git cherry-pick <hotfix-commit-hash>

# 2. Backport to release candidate
git checkout release-candidate
git cherry-pick <hotfix-commit-hash>
git push origin release-candidate

# 3. Backport to develop
git checkout develop
git cherry-pick <hotfix-commit-hash>
git push origin develop

# Create PRs for each backport:
gh pr create --base release-candidate --head backport/rc/hotfix-ENG-XXX
gh pr create --base develop --head backport/dev/hotfix-ENG-XXX

Handling Merge Conflicts

# If cherry-pick fails due to conflicts
git cherry-pick <hotfix-commit-hash>
# CONFLICT in file.ts

# Resolve conflicts manually
# Then:
git add file.ts
git cherry-pick --continue

Alternative: Patch File

# Create patch from hotfix
git format-patch -1 <hotfix-commit-hash>
# Creates: 0001-fix-auth-bypass.patch

# Apply to other branch
git checkout release-candidate
git apply 0001-fix-auth-bypass.patch

Summary

Emergency Release Quick Reference

Decision Tree

Is production broken?
├─ Yes → Severity level?
│  ├─ P0 (security/down) → Deploy immediately, inform after
│  ├─ P1 (critical bug) → Fast-track PR, deploy same day
│  └─ P2 (major bug) → Standard expedited process
└─ No → Use normal deployment process

Time Targets

  • P0: Issue → Deploy in < 2 hours
  • P1: Issue → Deploy in < 4 hours
  • P2: Issue → Deploy in < 24 hours

Key Principles

  1. Minimal Change: Fix only the immediate issue
  2. Add Regression Test: Prevent recurrence
  3. Fast Feedback: Deploy to staging first (except P0)
  4. Clear Communication: Keep stakeholders informed
  5. Learn & Improve: Conduct PIR for P0/P1

Checklist

  • Assess urgency correctly
  • Create hotfix branch from production
  • Implement minimal fix with regression test
  • Use appropriate PR labels
  • Test thoroughly (staging for P1/P2)
  • Get approval (or waive for P0)
  • Deploy with monitoring
  • Verify fix in production
  • Communicate status
  • Schedule PIR (P0/P1)
  • Create follow-up tickets

Use this skill when production issues require immediate attention and fast-track deployment outside normal release processes.

Weekly Installs
34
First Seen
Jan 23, 2026
Installed on
claude-code27
gemini-cli21
antigravity20
codex19
opencode19
cursor18