quality-postmortem by petrkindlmann/qa-skills

Before starting: Check for .agents/qa-project-context.md in the project root. It contains quality goals, risk areas, and test suite details that provide essential context for any postmortem analysis.

Discovery Questions

Do you have a regular retro cadence? Per-sprint, monthly, or only after incidents? Regular cadence catches slow-burn problems. Incident-only cadence misses patterns until they explode.
What triggered this postmortem? A production incident? A pattern of escaped bugs? A feeling that the test suite is not catching enough? Test suite degradation? The trigger determines the focus.
What data is available? Bug tracker with severity and discovery phase? CI history with pass rates? Flaky test reports? Coverage trends? Without data, postmortems devolve into opinion sessions.
What happened with previous postmortem action items? Were they completed? Tracked? Forgotten? If past action items are abandoned, the team has learned that postmortems do not matter. Fix the follow-through before running another postmortem.
Who should participate? Engineers who worked on the affected area. QA who tested (or did not test) the affected area. Product owner if the impact was user-facing. Engineering manager if systemic changes are needed. Keep the group to 4-8 people.
What are the current test suite health concerns? Rising flakiness? Slow execution? Coverage gaps in critical areas? Stale quarantine? Health reviews are proactive postmortems -- they prevent incidents instead of reacting to them.

Core Principles

1. Blameless Means Systemic

Blameless does not mean "no one is accountable." It means the analysis focuses on systems, processes, and tools rather than individual performance. "Why did the system allow this defect to escape?" is a blameless question. "Why did the developer not write a test?" is a blame question that stops the analysis too early. The developer did not write a test because: the test framework was hard to use, the PR checklist did not require it, there was no pairing to transfer knowledge, or time pressure made it feel optional. Those are systemic issues with systemic fixes.

2. Focus on Patterns, Not Incidents

A single escaped bug is an anecdote. Three escaped bugs in the same feature area over two months is a pattern. Postmortems should aggregate incidents to find recurring themes: same root cause, same team, same test gap, same phase of the pipeline. Patterns are actionable. Individual incidents are just fire-fighting.

3. Every Postmortem = 1-3 Concrete Action Items

An action item is concrete when it has: a specific deliverable ("add integration tests for the coupon API"), an owner ("assigned to Alex"), a deadline ("by end of sprint 14"), and a verification method ("PR merged, tests passing in CI"). "Improve testing" is not an action item. "Write 5 integration tests for the payment service edge cases by March 30" is.

4. Track to Completion

Action items that are not tracked are not completed. Use the team's existing work tracker (Jira, Linear, GitHub Issues). Tag them (postmortem-action or equivalent). Review completion status at the start of the next postmortem. If items are consistently abandoned, either the items are too large (break them down) or they are not prioritized (make them sprint commitments).

5. Measure Improvement

After implementing action items, measure whether the problem recurred. If the postmortem identified a gap in payment testing and the action was to add integration tests, track: did another payment bug escape? If yes, the action was insufficient. If no, the postmortem worked. Without measurement, postmortems are rituals, not tools.

Bug Pattern Analysis

Categorizing Escaped Defects

When a bug reaches production, classify it along three dimensions to identify prevention opportunities.

Dimension 1: Root Cause Category

Category	Description	Example
Logic error	Business logic incorrect or incomplete	Discount not applied for edge case currency
Integration failure	Two components do not communicate correctly	API returns different format than frontend expects
Data issue	Unexpected data shape, null values, encoding	User with emoji in name breaks CSV export
Race condition	Timing-dependent behavior	Two concurrent checkouts oversell last item
Configuration	Environment-specific settings wrong	Feature flag enabled in staging, disabled in prod
Regression	Previously working behavior broken	Refactor removed null check, old bug returns
Missing requirement	Behavior not specified, gap in product spec	No error handling for expired OAuth tokens
Performance	Functional but too slow under load	Search timeout with 100K+ records

Dimension 2: Which Test Level Should Have Caught It

Level	What it catches	If it escaped this level
Unit	Logic errors, edge cases, boundary conditions	Tests exist but missing edge case? Or no tests at all?
Integration	API contracts, data flow, service interactions	Integration tests exist? Do they cover error responses?
E2E	User journey failures, UI state management	Is this critical path covered? Was the specific scenario tested?
Manual/Exploratory	Visual issues, usability problems, unusual workflows	Was exploratory testing performed? Was the area in scope?
Monitoring	Performance degradation, error rate spikes	Are alerts configured? Are thresholds correct?

Dimension 3: Prevention Opportunity

Opportunity	Action	Example
Add test	Write a test at the appropriate level	Add unit test for currency rounding edge case
Improve existing test	Existing test was too narrow	Extend checkout E2E to include coupon + international currency
Add quality gate	CI check would have caught it	Add schema validation for API responses in CI
Improve requirements	Spec was ambiguous or incomplete	Add acceptance criteria for error states to story template
Add monitoring	Detect sooner even if not prevented	Add alert for error rate > 1% on payment endpoint
Training/Process	Knowledge gap or process gap	Run a session on defensive coding for nullable fields

Bug Pattern Analysis Template

Escaped Bug Analysis: [BUG-ID] [Title]
═══════════════════════════════════════

Timeline:
  Introduced:    [commit/PR/date]
  Released:      [release version/date]
  Detected:      [date, by whom — user report, monitoring, internal]
  Resolved:      [date]
  Time to detect: [hours/days]
  Time to fix:    [hours]

Classification:
  Root cause:         [logic error / integration / data / race condition / ...]
  Should-catch level: [unit / integration / E2E / monitoring]
  Prevention:         [add test / improve test / add gate / improve spec / ...]

Existing Coverage:
  Were there tests for this area?    [yes / no / partial]
  If yes, why did they miss it?      [edge case not covered / wrong assertion / ...]
  If no, why not?                    [area not identified as risky / time pressure / ...]

Impact:
  Users affected:    [count or estimate]
  Revenue impact:    [none / minor / significant / critical]
  Brand impact:      [none / minor / significant / critical]

Action Items:
  1. [Action] — Owner: [name] — Due: [date]
  2. [Action] — Owner: [name] — Due: [date]

Aggregating Patterns Over Time

After analyzing 10+ escaped bugs, look for patterns:

Escaped Bug Summary: [Q1 2026]
═══════════════════════════════

Total escaped bugs: 14

By root cause:
  Logic error:          5  (36%)  ← unit tests needed
  Integration failure:  4  (29%)  ← API contract tests needed
  Data issue:           3  (21%)  ← input validation gaps
  Configuration:        2  (14%)  ← env parity issues

By area:
  Checkout:             6  (43%)  ← highest risk, needs investment
  User management:      4  (29%)
  Reporting:            2  (14%)
  Settings:             2  (14%)

By should-catch level:
  Unit:                 5  (36%)  ← developers not testing edge cases
  Integration:          4  (29%)  ← missing integration test layer
  E2E:                  3  (21%)
  Monitoring:           2  (14%)

Top action themes:
  1. Add integration tests for checkout API (covers 4 of 14 bugs)
  2. Mandate unit tests for all calculation/validation logic (covers 5 of 14)
  3. Add currency and encoding edge cases to test data fixtures (covers 3 of 14)

This aggregation reveals where investment has the highest return: fixing one systemic issue (integration tests for checkout) would have prevented 29% of all escaped bugs.

Test Suite Health Review

A proactive postmortem for the test suite itself. Conduct quarterly or when symptoms appear.

Flaky Test Trends

Flaky Test Trend Review
═══════════════════════

Current flaky rate: _____ % (target: <2%)
Trend (last 3 months):
  Month 1: _____ %
  Month 2: _____ %
  Month 3: _____ %
Direction: [ ] Improving  [ ] Stable  [ ] Worsening

Top 5 flakiest tests (by failure count):
  1. _____________________ — _____ failures — root cause: _____
  2. _____________________ — _____ failures — root cause: _____
  3. _____________________ — _____ failures — root cause: _____
  4. _____________________ — _____ failures — root cause: _____
  5. _____________________ — _____ failures — root cause: _____

Quarantine:
  Tests in quarantine:    _____ count
  Oldest quarantine:      _____ days (target: <14)
  Quarantine resolved this month: _____ count

Execution Time Trend

Track current full suite duration, 3-month trend, and the 5 slowest tests. If duration is increasing, check for: tests that can move to nightly, sequential stages that can parallelize, slow test data setup (use API instead of UI), large test files that need splitting for better shard distribution.

Coverage Gap Review

Track overall coverage (lines/branches), critical paths with insufficient coverage (payments, auth, data export should be 90%+), recently changed code without test updates (cross-reference git log --since="30 days" with coverage report), and features shipped without E2E coverage.

Disabled/Skipped Test Inventory

Audit all skipped/disabled tests by age and reason. Tests skipped < 1 week are likely in progress. Tests skipped 1-4 weeks need a ticket and timeline. Tests skipped 1-3 months are overdue -- fix or delete. Tests skipped > 3 months should be deleted -- they will never be fixed. For each: fix and unskip, delete (obsolete), or move to quarantine with ticket link.

Process Improvement Cycles

The Improvement Sprint

Dedicate a fixed portion of each sprint (10-15% of capacity) to quality improvement, drawn from postmortem action items and health review findings.

Structure:

1. IDENTIFY    — Top 3 pain points from latest retro/postmortem
2. ROOT CAUSE  — 5 Whys analysis for the #1 pain point
3. PROPOSE     — Solution with effort estimate (S/M/L)
4. IMPLEMENT   — One improvement per sprint (start small)
5. MEASURE     — Did the metric improve? By how much?
6. ITERATE     — If not improved, dig deeper. If improved, tackle #2.

5 Whys Root Cause Analysis

The 5 Whys technique peels back surface symptoms to reveal systemic causes. The key discipline: keep asking "why" until you reach a process, system, or structural cause -- not an individual's action.

Example: Payment bug escaped to production

Problem: Users were charged twice for a single purchase.

Why 1: The payment API was called twice on form submit.
Why 2: The submit button was not disabled after the first click.
Why 3: The frontend developer did not implement button disabling.
Why 4: The acceptance criteria did not mention double-submit prevention.
Why 5: The story refinement process does not include edge case review
       for payment-related stories.

Root cause: Process gap — payment stories are not reviewed for transaction
safety edge cases before development begins.

Action: Add a "Payment Safety Checklist" to the story template for any
story touching payment flows. Checklist includes: idempotency,
double-submit prevention, partial failure handling, timeout behavior.
Owner: [Product Manager] — Due: [Next sprint]

5 Whys guidelines:

Stop when you reach something the team can change (process, tool, structure). Asking "why is the budget limited?" goes too far.
The chain may branch -- one symptom may have multiple contributing causes. Follow the most impactful branch.
Verify each "why" with evidence, not assumption. "The developer did not write tests" -- is that true? Check the PR. Maybe tests existed but were insufficient.
If you reach "human error" as a root cause, you have not gone far enough. Humans make errors. The system should make errors difficult or detectable.

Proposing Solutions with Effort Estimates

For each root cause, propose 1-3 solutions at different effort levels:

Root Cause: Payment stories not reviewed for edge cases

Solution A (Small — 1 day):
  Add payment safety checklist to story template.
  + Quick to implement, low maintenance
  − Relies on manual adherence, may be skipped under pressure

Solution B (Medium — 1 sprint):
  Add payment safety checklist + automated linting rule that
  flags PRs touching payment code without corresponding tests.
  + Automated enforcement, catches gaps in CI
  − Requires CI config change, may have false positives

Solution C (Large — 2 sprints):
  Solution B + add integration test suite for all payment
  edge cases (idempotency, timeout, partial failure).
  + Comprehensive protection, catches regressions automatically
  − Significant effort, needs test data and mock infrastructure

Recommendation: Start with A immediately, implement B this sprint,
plan C for next quarter as strategic work.

Postmortem Template for Quality Incidents

Use this template when a significant quality incident occurs (P0/P1 production bug, data loss, security issue, extended outage caused by code change).

# Quality Incident Postmortem: [INCIDENT-ID]

## Summary
[One paragraph: what happened, who was affected, how it was resolved]

## Severity and Impact
- **Severity:** [P0 / P1 / P2]
- **Users affected:** [count or percentage]
- **Duration:** [from detection to resolution]
- **Business impact:** [revenue, reputation, compliance]

## Timeline (all times in UTC)
| Time | Event |
|------|-------|
| HH:MM | [Code change deployed / feature flag enabled] |
| HH:MM | [First user report / monitoring alert] |
| HH:MM | [Incident acknowledged by on-call] |
| HH:MM | [Root cause identified] |
| HH:MM | [Fix deployed / rollback completed] |
| HH:MM | [Incident resolved, monitoring confirms recovery] |

## Root Cause
[Technical description of what went wrong]

## 5 Whys
1. Why did [symptom]? Because [cause 1].
2. Why [cause 1]? Because [cause 2].
3. Why [cause 2]? Because [cause 3].
4. Why [cause 3]? Because [cause 4].
5. Why [cause 4]? Because [root cause].

## What Tests Existed
- [List relevant existing tests and why they did not catch this]

## What Tests Were Missing
- [Specific test scenarios that would have prevented this]

## Detection
- **How was it detected?** [User report / monitoring / internal testing]
- **Could it have been detected earlier?** [Yes/No — how?]
- **Time from deploy to detection:** [duration]

## Prevention Measures

### Immediate (this sprint)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| [Write regression test for this specific scenario] | [name] | [date] | [ ] |
| [Add monitoring alert for this error pattern] | [name] | [date] | [ ] |

### Short-term (next 2 sprints)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| [Add integration tests for related edge cases] | [name] | [date] | [ ] |
| [Update deployment checklist] | [name] | [date] | [ ] |

### Long-term (this quarter)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| [Improve test coverage for entire area] | [name] | [date] | [ ] |
| [Process change to prevent similar gaps] | [name] | [date] | [ ] |

## Lessons Learned
- [What went well in detection and response]
- [What could have been better]
- [What systemic issue does this reveal]

Retro Meeting Template

Use this format for regular quality retrospectives (as opposed to incident-specific postmortems). Conduct per-sprint or monthly.

Agenda (30-60 minutes)

Quality Retro: Sprint [N] / [Month Year]
═════════════════════════════════════════

1. Previous Action Items Review (5 min)
   - Review status of action items from last retro
   - Mark completed, carry forward incomplete, escalate blocked

2. Data Review (10 min)
   Present metrics since last retro:
   - Escaped bug count and classification
   - Flaky test rate trend
   - CI pass rate trend
   - Coverage change
   - Test suite duration change
   - Quarantine inventory

3. What Went Well (5 min)
   - Quality wins: bugs caught early, smooth releases, good test coverage
   - Process improvements that paid off

4. What Needs Improvement (10 min)
   - Quality pain points: escaped bugs, flaky tests, slow pipeline, gaps
   - Process friction: review bottlenecks, unclear ownership, tooling issues

5. Root Cause Discussion (10-15 min)
   - Pick the top 1-2 issues from "Needs Improvement"
   - Run 5 Whys or group brainstorming
   - Identify systemic causes

6. Action Items (5-10 min)
   - Define 1-3 specific, assigned, time-bound action items
   - Each item: what, who, when, how to verify
   - Add to team's work tracker with "retro-action" tag

7. Close (2 min)
   - Confirm next retro date
   - Thank participants

Facilitator Notes

Prepare data in advance. Do not spend meeting time pulling up dashboards. Have metrics ready in a shared doc.
Timebox strictly. Quality retros expand to fill available time. 30 minutes is sufficient for a sprint retro. 60 minutes for monthly or incident-triggered.
Rotate facilitation. Different facilitators bring different perspectives. Rotate among QA engineers, developers, and tech leads.
Follow up within 24 hours. Send a summary with action items to the team channel. Link to tracker tickets. This signals that the retro was real work, not a talking exercise.
Review action items at the START of the next retro. This is the accountability mechanism. If items are consistently incomplete, reduce the number of items or reduce their scope.

Anti-Patterns

Blame-Driven Postmortems

Focusing on who made the mistake rather than what system allowed the mistake to reach production. Blame creates fear. Fear creates hiding. Hiding creates bigger incidents. When the question is "who wrote this bug?" people learn to avoid visibility. When the question is "what process gap allowed this?" people learn to improve the process.

Postmortems Without Action Items

A cathartic discussion that produces understanding but no change. If the meeting ends without specific, assigned action items, the same problem will recur. Worse, the team learns that postmortems are therapy sessions, not improvement tools.

Action Items Without Follow-Through

Generating action items that go into a backlog and are never prioritized. This is worse than no action items because it creates the illusion of improvement. If postmortem actions are not completed within 2 sprints, escalate. If they are consistently deprioritized, either the items are too ambitious or the team does not value them -- both need addressing.

Postmortems Only After Incidents

Waiting for a production fire to conduct a quality review. Proactive health reviews (test suite health, coverage trends, flaky test inventory) prevent incidents. Conduct proactive reviews monthly. Reactive postmortems for incidents only supplements the proactive cadence.

Root Cause Analysis That Stops Too Early

"The developer did not write a test" is not a root cause. It is a symptom. Why did they not write a test? Was the framework hard to use? Was there no time? Was there no requirement? Was there no pairing or review? Stopping at the individual level prevents systemic improvement.

Vague Action Items

"Improve test coverage" and "be more careful with deployments" are not action items. They cannot be tracked, measured, or verified. Compare: "Add integration tests for payment webhook handling, covering success, failure, and timeout scenarios. Owner: Alex. Due: Sprint 14. Verification: PR merged with 3 new integration tests passing in CI."

Data-Free Retros

Running quality retrospectives based on feelings and opinions rather than data. "It feels like we have more bugs lately" might be true or might be recency bias. Check the data: is the escaped bug count actually increasing? Where are the bugs concentrated? Without data, the team solves the loudest problem, not the most important one.

Done When

Escaped defect timeline reconstructed (introduced, released, detected, resolved) with supporting evidence from commit history and bug tracker
5 Whys root cause analysis completed and stopped at a systemic cause, not at "developer didn't write a test"
Test gap identified and mapped to a specific coverage hole (missing test type, missing scenario, or missing area)
Action items assigned with named owners and due dates, added to the team's work tracker with a postmortem tag
Findings shared with the team in a written summary — not siloed in QA or lost in a private document

Related Skills

qa-metrics -- Provides the data (defect escape rate, flakiness rate, coverage trends) that postmortems analyze and act upon.
test-reliability -- Flaky test classification and quarantine management, which feeds into test suite health reviews.
test-strategy -- When postmortems reveal systemic gaps, the test strategy is the document that gets updated.
shift-left-testing -- Many postmortem action items are shift-left practices: earlier testing, better requirements, dev/QA pairing.
release-readiness -- Quality gates and release criteria should be updated based on postmortem findings.