rollback-strategy-advisor
Rollback Strategy Advisor
"Just deploy the old version" only works if the new version didn't change anything the old version depends on. It usually did.
The reversibility question
Before choosing a strategy, classify what the bad deploy changed:
| Change type | Reversible by redeploying old code? | Why / why not |
|---|---|---|
| Stateless code only | ✅ Yes | Old code runs on old state; no state changed |
| Additive schema (new column, new table) | ✅ Yes | Old code ignores the new column |
| Destructive schema (drop column, rename) | ❌ No | Old code expects the column that's gone |
| Additive data (new rows) | ⚠️ Usually | Unless the new rows confuse old code's queries |
| Mutated data (UPDATE existing rows) | ❌ No | Old code expects old data shape; you need data repair |
| New external side effects (emails sent, payments made) | ❌ Never | Can't unsend. Compensating actions only. |
| Config only | ✅ Yes | Revert the config |
| Feature flag flip | ✅ Instantly | Flip it back — this is why flags exist |
Decision tree
Is there a feature flag gating the bad behavior?
├─ YES → Kill the flag. Done in seconds. Investigate at leisure.
└─ NO →
Did the deploy change data/schema?
├─ NO → Redeploy previous artifact. Done.
└─ YES →
Was the change additive-only?
├─ YES → Redeploy previous artifact. Clean up schema later.
└─ NO →
Can you roll FORWARD (fix is small, well-understood)?
├─ YES → Roll forward. Faster than untangling.
└─ NO → You're in data repair. See below.
Data repair (the hard case)
When old code can't run on new data:
- Stop the bleeding. Feature flag, maintenance page, or traffic drain — whatever stops more data from being mutated.
- Snapshot. Before you touch anything.
pg_dump/ volume snapshot. You will want this when your repair script has a bug. - Assess scope. How many rows?
SELECT count(*) WHERE <mutated-condition>. 10 rows is a manual fix. 10 million is a migration. - Repair or compensate:
- Reversible mutation → write the inverse UPDATE
- Irreversible mutation → restore affected rows from snapshot/backup
- External side effects → compensating action (refund, apology email, manual ticket)
- Then redeploy old code.
Design-time advice — make rollbacks boring
If you're not mid-incident, the best advice is to make the next deploy reversible:
| Technique | Makes rollback trivial when |
|---|---|
| Feature flags | You're shipping a behavior change — gate it |
| Expand-contract migrations | You're changing schema — add new alongside old, migrate, remove old in a later deploy |
| Dual-write period | You're changing data format — write both formats until new code is stable |
| Immutable artifacts by SHA | You're deploying — image:abc123 can always be re-deployed; image:latest can't |
| Backward-compatible APIs | You're changing an interface — new version reads old format |
Worked example
Situation: Deployed v2.4.0 at 14:00. At 14:20, error rate spikes. v2.4.0 added a NOT NULL column users.tenant_id with a default, and the new code reads it.
Reversibility check: Schema change was additive (new column with default) → old code should ignore it. ✅ Reversible.
Wait — check the migration. It ran ALTER TABLE users ADD COLUMN tenant_id ... NOT NULL DEFAULT 1. But old code does INSERT INTO users (...) without tenant_id. Does NOT NULL DEFAULT allow that? Yes — the default fires. ✅ Still reversible.
Action: Redeploy v2.3.9. kubectl rollout undo deployment/app. The column stays; old code ignores it. Clean up never — the column is fine, the bug is elsewhere in v2.4.0.
Post-mortem note: If the migration had been NOT NULL without a default, old code's INSERTs would fail. That would have been a non-reversible schema change masquerading as additive.
Do not
- Do not roll back before you understand what changed. Rolling back into a broken state is worse than being broken in a known state.
- Do not roll forward under pressure with an untested fix. Rolling forward is for when you know the fix; it's not a license to deploy a guess.
- Do not skip the snapshot before data repair. The repair will have a bug. It always does.
- Do not
DROPthe new column/table during rollback. Leave it. It's harmless, and the next deploy attempt will need it. - Do not design a rollback strategy during an incident. Design it when you design the deploy. →
cd-pipeline-generatorshould include the undo path.
Output format
Incident mode:
## Reversibility
<change type> → <reversible: yes/no/partial>
## Recommended action
<flag kill | redeploy | roll forward | data repair>
## Steps
1. ...
## If this doesn't work
<next fallback>
Design mode:
## This deploy's reversibility class
<from the table>
## To make it cheaply reversible
<specific technique: flag / expand-contract / dual-write>
## Rollback command (pre-written — paste during incident)
<exact command>
More from santosomar/general-secure-coding-agent-skills
dependency-resolver
Diagnoses and resolves package dependency conflicts — version mismatches, diamond dependencies, cycles — across npm, pip, Maven, Cargo, and similar ecosystems. Use when install fails with a resolution error, when two packages require incompatible versions of a third, or when upgrading one dependency breaks another.
4configuration-generator
Generates configuration files for services and tools (app config, logging config, linter config, database config) from a brief description of desired behavior, matching the target format's idioms. Use when bootstrapping a new service, when the user asks for a config file for a specific tool, or when translating config intent between formats.
3ci-pipeline-synthesizer
Generates CI pipeline configs by analyzing a repo's structure, language, and build needs — GitHub Actions, GitLab CI, or other platforms. Use when bootstrapping CI for a new repo, when porting from one CI to another, when the user asks for a pipeline that builds and tests their project, or when wiring in security gates.
3api-design-assistant
Reviews and designs API contracts — function signatures, REST endpoints, library interfaces — for usability, evolvability, and the principle of least surprise. Use when designing a new public interface, when reviewing an API PR, when the user asks whether a signature is well-designed, or when planning a breaking change.
2code-refactoring-assistant
Executes refactorings — extract method, inline, rename, move — in small, behavior-preserving steps with a test between each. Use when the user wants to restructure working code, when cleaning up after a feature lands, or when a smell has been identified and needs fixing.
2code-smell-detector
Identifies code smells — structural patterns that correlate with maintainability problems — and explains why each matters in context. Use when reviewing a PR for structural quality, when the user asks what's wrong with a piece of code that isn't buggy, or when prioritizing refactoring targets.
2