rollback-strategy-advisor
Rollback Strategy Advisor
Provide safe and effective rollback strategies for failed deployments.
Core Capabilities
This skill helps recover from failed deployments by:
- Assessing failure impact - Identifying what failed and what needs rollback
- Recommending rollback strategy - Choosing the appropriate approach based on failure type
- Providing step-by-step guidance - Clear procedural instructions for execution
- Validating rollback success - Ensuring system returns to stable state
- Preventing data loss - Protecting critical data during rollback operations
Rollback Strategy Workflow
Step 1: Assess the Failure
Understand what failed and the scope of impact.
Key Questions:
- What component failed? (Application, database, infrastructure, configuration)
- When did the failure occur? (During deployment, post-deployment, gradual degradation)
- What is the current system state? (Partially deployed, fully deployed, crashed)
- Is the system serving traffic? (Production load, maintenance mode, offline)
- Are there data changes? (Database migrations applied, user data modified)
Gather Information:
# Check deployment status
docker ps -a # Container status
docker logs <container-name> --tail 100 # Recent logs
# Check system health
curl http://localhost:8080/health # Health endpoint
docker stats # Resource usage
# Identify deployment artifacts
docker images | grep <app-name> # Available images
git log --oneline -10 # Recent commits
Output: Failure Assessment
Component: Application container
Failure Time: 5 minutes post-deployment
System State: New version deployed, returning 500 errors
Traffic: Receiving production traffic (degraded service)
Data Changes: No database migrations in this deployment
Step 2: Choose Rollback Strategy
Select the appropriate strategy based on failure type and system state.
Strategy Decision Tree:
Is database migration involved?
├─ YES → See "Database Rollback Strategy" (Step 3.4)
└─ NO → Continue
Is infrastructure changed?
├─ YES → See "Infrastructure Rollback Strategy" (Step 3.3)
└─ NO → Continue
Is configuration changed?
├─ YES → See "Configuration Rollback Strategy" (Step 3.2)
└─ NO → Application Code Rollback (Step 3.1)
Common Strategies:
| Failure Type | Strategy | Risk Level | Downtime |
|---|---|---|---|
| Application code bug | Redeploy previous image | Low | Seconds |
| Configuration error | Restore previous config | Low | Seconds |
| Infrastructure change | Revert compose file | Medium | Minutes |
| Database migration | Reverse migration + app rollback | High | Minutes |
| Multiple components | Sequential rollback (reverse order) | High | Minutes |
Step 3: Execute Rollback
Perform the rollback with validation at each step.
Step 3.1: Application Code Rollback
Revert to the previous working application version.
Standard Procedure:
# 1. Identify previous working version
docker images | grep <app-name>
# Look for the previous tag (e.g., v1.2.3 if current is v1.2.4)
# 2. Stop current containers
docker-compose stop <service-name>
# 3. Update docker-compose.yml to previous version
# Change: image: myapp:v1.2.4 → image: myapp:v1.2.3
# 4. Start with previous version
docker-compose up -d <service-name>
# 5. Validate rollback (see Step 4)
curl http://localhost:8080/health
docker logs <service-name> --tail 50
Fast Rollback (if compose file unchanged):
# Restart with previous image tag
docker-compose stop <service-name>
docker run -d --name <service-name> \
--network <network-name> \
-p 8080:8080 \
myapp:v1.2.3
# Or update compose and restart
sed -i 's/myapp:v1.2.4/myapp:v1.2.3/g' docker-compose.yml
docker-compose up -d <service-name>
Considerations:
- Keep previous images available (don't prune immediately after deploy)
- Tag images with version numbers or git commit SHAs
- Test the previous version still works in staging first if possible
- Monitor resource usage during rollback
Step 3.2: Configuration Rollback
Restore previous configuration files or environment variables.
Configuration File Rollback:
# 1. Locate configuration backup or git history
git log -- config/app.conf
git show HEAD~1:config/app.conf > config/app.conf
# 2. Update mounted config in docker-compose.yml if needed
# Ensure volume mount points to correct config
# 3. Restart services to load previous config
docker-compose restart <service-name>
# 4. Validate configuration loaded correctly
docker exec <service-name> cat /app/config/app.conf
curl http://localhost:8080/health
Environment Variable Rollback:
# 1. Edit docker-compose.yml to restore previous env vars
# Update the environment section or .env file
# 2. Recreate container with new env vars
docker-compose up -d --force-recreate <service-name>
# 3. Verify environment variables
docker exec <service-name> env | grep APP_
Feature Flag Rollback:
# If using feature flags, disable the problematic feature
# Update flag config or environment variable
# Example: FEATURE_NEW_CHECKOUT=false
docker-compose restart <service-name>
Considerations:
- Keep configuration in version control (git)
- Use .env files for environment-specific configs
- Backup configs before deployment
- Validate config syntax before applying
Step 3.3: Infrastructure Rollback
Revert infrastructure changes like network configurations, volume mounts, or docker-compose structure.
Docker Compose Rollback:
# 1. Restore previous docker-compose.yml from git
git checkout HEAD~1 -- docker-compose.yml
# 2. Recreate infrastructure
docker-compose down
docker-compose up -d
# 3. Validate all services running
docker-compose ps
docker-compose logs --tail 50
Network Configuration Rollback:
# If network configuration changed
# 1. Remove new network
docker network rm <new-network>
# 2. Recreate previous network
docker network create --driver bridge <old-network>
# 3. Reconnect containers
docker network connect <old-network> <container-name>
Volume Rollback:
# If volume mounts changed (be careful with data!)
# 1. Stop services
docker-compose stop
# 2. Update docker-compose.yml volume configuration
git checkout HEAD~1 -- docker-compose.yml
# 3. Restart services
docker-compose up -d
# Note: Data in volumes persists, only mount configuration changes
Considerations:
- Infrastructure changes may affect multiple services
- Test in staging environment first if possible
- Document infrastructure dependencies
- Consider using infrastructure-as-code tools (Terraform)
Step 3.4: Database Rollback
Reverse database migrations and restore schema to previous state.
Migration Rollback (with Migration Tool):
# Using Alembic (Python)
docker exec <db-container> alembic downgrade -1 # Rollback one migration
docker exec <db-container> alembic downgrade <revision> # Rollback to specific revision
# Using Flyway (Java)
docker exec <app-container> flyway undo # Rollback last migration
# Using Django
docker exec <app-container> python manage.py migrate <app> <migration>
# Using Rails
docker exec <app-container> rails db:rollback STEP=1
Manual Migration Rollback:
# 1. Identify the migration to reverse
docker exec <db-container> psql -U user -d dbname -c "\d+" # List tables
# 2. Execute reverse migration SQL
docker exec <db-container> psql -U user -d dbname -f /migrations/rollback_v1.2.4.sql
# 3. Verify schema state
docker exec <db-container> psql -U user -d dbname -c "\d table_name"
Database Rollback with Application:
# CRITICAL: Rollback database BEFORE rolling back application
# to prevent new app code from working with old schema
# 1. Stop application (prevent new requests)
docker-compose stop app-service
# 2. Backup current database state
docker exec <db-container> pg_dump -U user dbname > backup_$(date +%Y%m%d_%H%M%S).sql
# 3. Rollback migration
docker exec <db-container> alembic downgrade -1
# 4. Rollback application to version compatible with old schema
docker-compose stop app-service
sed -i 's/myapp:v1.2.4/myapp:v1.2.3/g' docker-compose.yml
docker-compose up -d app-service
# 5. Validate
curl http://localhost:8080/health
docker logs app-service --tail 50
Considerations:
- Always backup before rollback - Database changes are risky
- Coordinate app and DB rollback carefully
- Test rollback migrations in staging
- Consider data loss implications (irreversible data changes)
- For destructive migrations (dropped columns), may need data restore from backup
- Use database versioning tools (Alembic, Flyway, Liquibase)
See references/database_rollback_patterns.md for detailed migration rollback examples and data preservation strategies.
Step 4: Validate Rollback Success
Confirm the system is working correctly after rollback.
Health Checks:
# 1. Container health
docker ps # All containers running?
docker-compose ps # Services in "Up" state?
# 2. Application health
curl http://localhost:8080/health
curl -I http://localhost:8080 # HTTP status code
# 3. Service logs
docker logs <service-name> --tail 100 | grep ERROR
docker logs <service-name> --tail 100 | grep WARN
# 4. Database connectivity
docker exec <app-container> psql -U user -d dbname -c "SELECT 1;"
# 5. Resource usage
docker stats --no-stream
Functional Testing:
# Test critical user flows
curl -X POST http://localhost:8080/api/login -d '{"user":"test","pass":"test"}'
curl http://localhost:8080/api/users/1
# Run smoke tests if available
docker exec <app-container> pytest tests/smoke/
# Check monitoring dashboards
# - Response times back to normal?
# - Error rates dropped?
# - Traffic being served?
Validation Checklist:
- ✓ All containers running
- ✓ Health endpoints returning 200
- ✓ No error spikes in logs
- ✓ Database queries executing
- ✓ Critical API endpoints responding
- ✓ Monitoring shows normal metrics
- ✓ Users can access the application
Step 5: Document and Communicate
Record the incident and inform stakeholders.
Incident Report Template:
## Deployment Rollback - [Date/Time]
**Summary:** Brief description of what failed and rollback action taken
**Timeline:**
- [Time] - Deployment started (v1.2.4)
- [Time] - Failure detected (500 errors)
- [Time] - Rollback initiated
- [Time] - Rollback completed
- [Time] - System validated stable
**Root Cause:** What caused the deployment to fail
**Rollback Actions:**
1. Stopped application service
2. Reverted docker-compose.yml to v1.2.3
3. Restarted service
4. Validated health checks
**Impact:**
- Downtime: X minutes
- Affected users: Y requests failed
- Data loss: None
**Follow-up Actions:**
- [ ] Fix root cause in v1.2.5
- [ ] Add test coverage for failure scenario
- [ ] Update deployment checklist
- [ ] Review rollback procedure effectiveness
Communication:
Team notification (Slack/email):
🚨 Deployment Rollback Completed
We rolled back the v1.2.4 deployment due to [issue].
System is now stable on v1.2.3.
Impact: X minutes downtime
Status: Fully operational
Next steps: Root cause analysis, fix in v1.2.5
For details see: [link to incident report]
Step 6: Prevent Future Failures
Analyze the incident and improve deployment practices.
Post-Incident Review:
-
What went wrong?
- Code bug not caught in testing
- Configuration incompatibility
- Missing database index caused performance degradation
- Infrastructure resource limits exceeded
-
Why wasn't it caught earlier?
- Insufficient test coverage
- Staging environment differs from production
- Load testing not performed
- Migration not tested with production data volume
-
What can prevent this?
- Add integration test for failure scenario
- Improve staging/production parity
- Implement canary deployments
- Add automated rollback triggers
- Enhance monitoring and alerting
Deployment Improvements:
# Implement health checks in docker-compose.yml
services:
app:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Rollback Automation:
# Create rollback script for quick recovery
#!/bin/bash
# rollback.sh - Quick rollback to previous version
PREVIOUS_VERSION=$1
if [ -z "$PREVIOUS_VERSION" ]; then
echo "Usage: ./rollback.sh <version>"
exit 1
fi
echo "Rolling back to $PREVIOUS_VERSION..."
docker-compose stop app
sed -i "s/myapp:.*$/myapp:$PREVIOUS_VERSION/g" docker-compose.yml
docker-compose up -d app
docker-compose ps
echo "Rollback complete. Check logs: docker logs app"
Best Practices:
- Maintain deployment history - Keep previous images, configs, and compose files
- Test rollback procedures - Practice rollbacks in staging regularly
- Automate health checks - Use docker healthchecks and monitoring
- Version everything - Tag images, version configs, track migrations
- Backup before risky changes - Database backups before migrations
- Document dependencies - Track what depends on what for coordinated rollbacks
- Gradual rollouts - Use canary or blue-green deployments when possible
- Monitor post-deployment - Watch metrics closely for 30+ minutes after deploy
Quick Reference
Rollback Decision Matrix
| Scenario | Strategy | Estimated Time |
|---|---|---|
| App code bug, no DB changes | Redeploy previous image | 1-2 minutes |
| Config error | Restore previous config | 1-2 minutes |
| Failed DB migration | Reverse migration + app rollback | 5-10 minutes |
| Infrastructure change | Revert compose file | 3-5 minutes |
| Multiple component failure | Sequential rollback (DB → App → Infra) | 10-15 minutes |
Common Rollback Commands
# Quick app rollback
docker-compose stop <service>
sed -i 's/v1.2.4/v1.2.3/g' docker-compose.yml
docker-compose up -d <service>
# Config rollback
git checkout HEAD~1 -- config/
docker-compose restart <service>
# Database migration rollback
docker exec <db-container> alembic downgrade -1
# Full infrastructure rollback
git checkout HEAD~1 -- docker-compose.yml
docker-compose down && docker-compose up -d
Resources
references/database_rollback_patterns.md- Detailed database migration rollback strategies and data preservation techniquesreferences/platform_guides.md- Docker and Docker Compose specific rollback procedures and best practices
Best Practices
- Always backup before rollback - Especially for database changes
- Test rollback in staging first - If time permits
- Stop traffic during risky rollbacks - Prevent inconsistent state
- Rollback in reverse order - Undo changes in opposite sequence of deployment
- Validate each step - Don't proceed if validation fails
- Document everything - Create audit trail for compliance and learning
- Communicate clearly - Keep stakeholders informed of status
- Practice rollbacks regularly - Ensure procedures work when needed
- Automate common rollbacks - Reduce human error and recovery time
- Learn from failures - Use incidents to improve deployment process