skills/petrkindlmann/qa-skills/risk-based-testing

risk-based-testing

Installation
SKILL.md

Discovery Questions

Before building a risk model, gather context from stakeholders across engineering, product, and operations. Check .agents/qa-project-context.md first -- if it exists, use it as the foundation and skip questions already answered there.

Revenue-Critical Flows

  • Which user flows directly generate revenue? (checkout, subscription, billing, upgrades)
  • What is the revenue impact per hour of downtime for each flow?
  • Are there time-sensitive flows? (flash sales, market-hours trading, payroll deadlines)
  • Which flows have contractual SLAs with financial penalties?

Recent Failures

  • What broke in the last 3 releases? What escaped to production?
  • What were the root causes? (code defect, config error, third-party failure, data migration)
  • What was the blast radius of each incident? (users affected, revenue lost, reputation impact)
  • Were there near-misses caught late in testing that could have escaped?

Fragile Areas

  • Which parts of the codebase change most frequently? (high churn = high risk)
  • Which modules have the lowest test coverage today?
  • Which areas have the most complex business logic or the most conditional branches?
  • Which code was written by engineers who have since left the team?

Third-Party Dependencies

  • Which external services does the product depend on? (payment processors, auth providers, CDNs, APIs)
  • What is the historical reliability of each dependency?
  • What happens when each dependency goes down? (graceful degradation or hard failure?)
  • Are there single points of failure with no fallback?

Compliance and Data

  • What regulatory requirements apply? (GDPR, PCI-DSS, HIPAA, SOC2, SOX)
  • What data is most sensitive? (PII, financial, health, credentials)
  • What are the legal consequences of a data breach or compliance violation?
  • Are there audit requirements that mandate specific testing evidence?

Core Principles

1. Not All Features Are Equal

A bug in the checkout flow that prevents purchases is categorically different from a misaligned icon on a settings page. Testing effort must reflect this reality. Equal coverage across all features wastes resources on low-risk areas while leaving critical paths under-tested.

2. Risk = Impact x Probability

Risk is not a gut feeling. It is a product of two measurable dimensions: how bad is it if this fails (impact), and how likely is it to fail (probability). Both must be assessed independently and scored consistently across the product.

3. Risk Assessment Is Continuous

A risk model created once and never updated is dangerous because it creates false confidence. Risk changes when the product changes, when dependencies change, when the team changes, and after every production incident. Build reassessment into the development rhythm.

4. Near-Misses Are Data

A bug caught in staging that would have been catastrophic in production is not a success story -- it is a signal that the risk model underestimated that area. Track near-misses with the same rigor as production incidents.

5. Risk Informs Coverage, Not the Other Way Around

Do not start with "we need 80% coverage everywhere." Start with "where would a failure hurt most?" and let the risk model drive coverage targets per module.


Workflow

1. Identify → 2. Classify → 3. Analyze → 4. Heatmap → 5. Coverage → 6. Reassess → (repeat)
                       score ≥ 10 only; skip to 4 if no items reach threshold

Phase 1: Risk Identification

Enumerate everything that could go wrong. Cast a wide net. Sources include:

  • Stakeholder interviews: Product managers know business-critical flows. Engineers know fragile code. Support knows recurring user complaints.
  • Incident history: Past failures predict future failures. Review post-mortems from the last 6-12 months.
  • Dependency mapping: List every external service, database, message queue, and third-party API. Each is a risk vector.
  • Change analysis: Areas with frequent code changes (use git log --stat) have higher defect probability.
  • Architecture review: Shared databases, single points of failure, synchronous chains, and tightly coupled modules amplify blast radius.

Output: A raw list of risk items, each describing what could fail and what the consequence would be.

Phase 2: Risk Classification

Categorize each risk item along two axes.

Impact categories (how bad is it):

Score Level Definition Examples
5 Catastrophic Revenue loss, data breach, legal action, user safety Payment processing fails, PII exposed
4 Major Significant user impact, SLA violation, major feature broken Login broken for segment, data corruption
3 Moderate Workflow disrupted, workaround exists Search returns wrong results, export fails
2 Minor Cosmetic or minor UX issue Alignment bug, slow non-critical page
1 Negligible No user impact, internal only Admin tooltip wrong, log format issue

Probability categories (how likely is it):

Score Level Definition Indicators
5 Frequent Expected in most releases High code churn, no tests, complex logic
4 Likely Will probably happen within a quarter Recent changes, partial coverage, known tech debt
3 Possible Could happen, has happened before Moderate complexity, some coverage
2 Unlikely Improbable but not impossible Stable code, good coverage, simple logic
1 Rare Requires exceptional circumstances Well-tested, rarely changed, simple

Phase 3: Failure Mode Analysis

For each high-risk item (score >= 10), perform a detailed failure mode analysis.

Failure Mode Analysis Template:

Feature/Component: [name]
Risk Score: [impact x probability]

Failure Mode 1: [what specifically can fail]
  Trigger:          [what causes this failure]
  Blast Radius:     [users affected, systems affected, data affected]
  Detection Method: [how would we know this happened -- monitoring, user report, test]
  Current Mitigation: [existing tests, monitoring, feature flags, fallbacks]
  Gap:              [what is missing from current mitigation]

Failure Mode 2: ...

Example -- E-commerce Checkout:

Feature/Component: Checkout Flow
Risk Score: 20 (Impact: 5, Probability: 4)

Failure Mode 1: Payment charge succeeds but order not recorded
  Trigger:          Race condition between payment API callback and order write
  Blast Radius:     Individual users; money charged but no order confirmation
  Detection Method: Payment reconciliation job (runs hourly), user complaint
  Current Mitigation: Idempotency key on payment, retry on order write
  Gap:              No automated test for the race condition; reconciliation delay is 1 hour

Failure Mode 2: Discount code applies incorrect amount
  Trigger:          Percentage discount on already-discounted item
  Blast Radius:     All users with stacked discounts; revenue leakage
  Detection Method: Margin monitoring alert (>5% deviation)
  Current Mitigation: Unit tests for single discounts
  Gap:              No tests for discount stacking; no tests for rounding edge cases

Failure Mode 3: Inventory not reserved during checkout
  Trigger:          Concurrent purchases of last-stock item
  Blast Radius:     Oversold items, fulfillment failure, customer trust
  Detection Method: Fulfillment team discovers during packing
  Current Mitigation: Database-level stock check on order creation
  Gap:              No load test simulating concurrent last-item purchases

Phase 4: Risk Heatmap

Visualize all risk items on a 5x5 matrix to communicate priorities to stakeholders and drive coverage decisions.

Risk Heatmap Template

                    PROBABILITY
                    Rare(1)   Unlikely(2)  Possible(3)  Likely(4)   Frequent(5)
                   +----------+-----------+-----------+----------+-----------+
  Catastrophic(5)  |  5  MED  | 10  HIGH  | 15  CRIT  | 20 CRIT  | 25  CRIT  |
                   +----------+-----------+-----------+----------+-----------+
  Major(4)         |  4  LOW  |  8  MED   | 12  HIGH  | 16 CRIT  | 20  CRIT  |
I                  +----------+-----------+-----------+----------+-----------+
M  Moderate(3)     |  3  LOW  |  6  MED   |  9  MED   | 12 HIGH  | 15  CRIT  |
P                  +----------+-----------+-----------+----------+-----------+
A  Minor(2)        |  2  LOW  |  4  LOW   |  6  MED   |  8 MED   | 10  HIGH  |
C                  +----------+-----------+-----------+----------+-----------+
T  Negligible(1)   |  1  LOW  |  2  LOW   |  3  LOW   |  4 LOW   |  5  MED   |
                   +----------+-----------+-----------+----------+-----------+

Color coding and action mapping:

Zone Score Range Color Testing Action
CRITICAL 15-25 Red Automate fully + monitor in production + load test + manual exploratory
HIGH 10-14 Orange Automate fully + periodic manual review
MEDIUM 5-9 Yellow Automate happy path + key error cases
LOW 1-4 Green Manual testing on release or skip entirely

Populated Heatmap Example

                    Rare(1)   Unlikely(2)  Possible(3)  Likely(4)   Frequent(5)
                   +----------+-----------+-----------+----------+-----------+
  Catastrophic(5)  |          | Auth      | Payments  | Checkout |           |
                   |          | bypass    | fail      | crash    |           |
                   +----------+-----------+-----------+----------+-----------+
  Major(4)         |          |           | Data      | Search   | User      |
                   |          |           | export    | broken   | upload    |
                   +----------+-----------+-----------+----------+-----------+
  Moderate(3)      |          | Report    | Email     | Profile  |           |
                   |          | format    | delivery  | edit     |           |
                   +----------+-----------+-----------+----------+-----------+
  Minor(2)         | Footer   | Tooltip   | Theme     |          |           |
                   | link     | text      | switch    |          |           |
                   +----------+-----------+-----------+----------+-----------+
  Negligible(1)    | Admin    |           |           |          |           |
                   | label    |           |           |          |           |
                   +----------+-----------+-----------+----------+-----------+

Phase 5: Test Coverage Alignment

Map test density to risk level. Every risk zone gets a prescribed testing approach.

Coverage Requirements by Risk Zone

Risk Zone Unit Tests Integration Tests E2E Tests Manual Testing Monitoring
CRITICAL (15-25) 90%+ branch coverage All service boundaries Full user journey + error paths Exploratory each release Real-time alerts, synthetic checks
HIGH (10-14) 80%+ branch coverage Key interactions Happy path + top 3 error paths Spot checks Dashboard + daily review
MEDIUM (5-9) 70%+ branch coverage Happy path only Happy path only On major changes Weekly review
LOW (1-4) Basic happy path None required None required On initial build None required

Gap Identification

Compare current coverage against required coverage per risk zone:

Gap Analysis Worksheet:

Feature: [name]
Risk Zone: [CRITICAL / HIGH / MEDIUM / LOW]
Risk Score: [number]

Required Coverage:
  Unit:        [target %]     Current: [actual %]     Gap: [delta]
  Integration: [required?]    Current: [exists? y/n]  Gap: [missing scenarios]
  E2E:         [required?]    Current: [exists? y/n]  Gap: [missing flows]
  Monitoring:  [required?]    Current: [exists? y/n]  Gap: [missing alerts]

Priority: [P0 / P1 / P2 / P3]
Estimated Effort: [hours / story points]
Owner: [name]
Target Sprint: [sprint number]

Phase 6: Monitoring and Reassessment

Risk assessment is not a one-time activity. Build reassessment into the team's rhythm.

Reassessment triggers:

  • After every production incident (within 48 hours)
  • When a new feature area is introduced
  • When a critical dependency changes (API version, provider switch)
  • When team composition changes significantly
  • Quarterly at minimum, even without triggers

Continuous risk signals to monitor:

  • Code churn by module: git log --since="3 months ago" --format='' --name-only | sort | uniq -c | sort -rn | head -20
  • Defect clustering: Which modules produce the most bugs? Track with issue labels.
  • Near-miss frequency: How often do staging/QA catches prevent production incidents?
  • Dependency health: Monitor status pages and uptime of critical third-party services.
  • Coverage trends: Is coverage increasing or decreasing in high-risk areas?

Real-World Examples

Example 1: E-commerce Checkout

Risk profile: Impact 5, Probability 4, Score 20 (CRITICAL)

Failure modes identified:

  • Payment charged but order not created (race condition)
  • Discount stacking applies incorrect total
  • Inventory oversold under concurrent load
  • Shipping calculator returns wrong rate for international addresses
  • Tax calculation wrong for specific jurisdictions

Test coverage prescribed:

  • Unit tests: discount calculation (all combinations), tax rules (per jurisdiction), inventory decrement logic
  • Integration tests: payment gateway communication (success, failure, timeout, duplicate), order creation pipeline, inventory reservation under concurrency
  • E2E tests: full checkout flow (guest + logged in), checkout with discount, checkout with international shipping, checkout retry after payment failure
  • Load tests: 100 concurrent checkouts for last-stock item
  • Monitoring: real-time order completion rate, payment-to-order reconciliation every 5 minutes, revenue anomaly detection

Example 2: Content Loading (Media Platform)

Risk profile: Impact 4, Probability 3, Score 12 (HIGH)

Failure modes identified:

  • CDN cache miss causes origin overload
  • Video transcoding fails silently for specific codecs
  • Thumbnail generation timeout leaves blank images
  • Content recommendation engine returns stale or empty results

Test coverage prescribed:

  • Unit tests: transcoding pipeline input validation, recommendation scoring algorithm
  • Integration tests: CDN purge/refresh flow, transcoding job queue processing, thumbnail generation for each supported format
  • E2E tests: content upload through playback, content discovery through recommendation click
  • Monitoring: CDN hit ratio, transcoding failure rate, thumbnail generation latency p99

Example 3: Third-Party API Integration

Risk profile: Impact 4, Probability 4, Score 16 (CRITICAL)

Failure modes identified:

  • API rate limit exceeded during peak traffic
  • API response schema changes without notice (breaking deserialization)
  • API timeout causes cascade failure in synchronous call chain
  • API returns 200 with error body (non-standard error handling)

Test coverage prescribed:

  • Unit tests: response parser for all known response shapes including malformed responses, rate limit backoff calculation, circuit breaker state transitions
  • Integration tests: API contract tests (validate response schema against expected shape), timeout handling, retry behavior, circuit breaker activation
  • E2E tests: user flow when API is slow (degraded but functional), user flow when API is down (graceful fallback)
  • Monitoring: API response time p50/p95/p99, error rate, rate limit proximity, circuit breaker state

Example 4: Authentication Flows

Risk profile: Impact 5, Probability 2, Score 10 (HIGH)

Failure modes identified:

  • Session token not invalidated on password change
  • OAuth callback race condition allows account takeover
  • MFA bypass through API endpoint that skips MFA check
  • Rate limiting not enforced on login endpoint (brute force)

Test coverage prescribed:

  • Unit tests: token generation and validation, password hashing, MFA code verification, rate limit counter logic
  • Integration tests: full auth flow (register, login, logout, password reset), session invalidation on credential change, OAuth flow with all supported providers, MFA enrollment and verification
  • E2E tests: login flow (valid credentials, invalid, locked account), password reset flow, MFA flow
  • Security tests: brute force attempt (verify rate limiting), session fixation, token reuse after logout
  • Monitoring: failed login rate spike, unusual session patterns, MFA bypass attempts

Anti-Patterns

Testing Everything Equally

Applying the same coverage targets and test density to every feature regardless of risk. A 90% coverage target on a settings page wastes effort that should go toward payment processing or authentication. Let the risk model drive allocation.

One-Time Risk Assessment

Creating a risk matrix during planning and never updating it. The product, team, and dependencies all change continuously. A risk model from 6 months ago does not reflect today's reality. Schedule reassessment and enforce it.

Ignoring Near-Misses

Treating bugs caught in staging or QA as pure successes. Near-misses are risk signals. If a critical bug was caught only by manual testing in staging, that means the automated safety net has a gap. Document near-misses and adjust the risk model.

Risk Theater

Going through the motions of risk assessment (filling in matrices, creating heatmaps) without actually changing test allocation. If the risk heatmap exists but test coverage does not align to it, the exercise was wasted. Verify alignment quarterly.

Anchoring on Historical Risk

Over-weighting past incidents and under-weighting new risk vectors. A module that failed 2 years ago and has since been rewritten may no longer be high risk. Conversely, a new integration with a third party has unknown risk that deserves attention.

Confusing Severity with Priority

Severity measures how bad a failure is. Priority measures how urgently to test it. A catastrophic but extremely rare failure (earthquake destroys data center) might be lower priority than a moderate but frequent failure (search results occasionally wrong). The risk matrix accounts for both dimensions -- use the composite score, not impact alone.


Done When

  • A risk matrix exists with every in-scope feature scored on both impact (1-5) and probability (1-5) axes
  • Each feature's composite risk score places it in a named zone (CRITICAL, HIGH, MEDIUM, or LOW) with a corresponding testing action assigned
  • Features scoring 10+ have a completed failure mode analysis with blast radius, detection method, and coverage gap documented
  • Test coverage requirements per risk zone are mapped against current coverage, with gaps explicitly listed and assigned to an owner and target sprint
  • Reassessment triggers and cadence are documented (quarterly minimum, plus post-incident)

Related Skills

  • test-strategy -- The overall QA strategy document that risk-based testing feeds into; risk assessment is one component of a broader strategy.
  • test-planning -- Sprint-level test planning uses risk priorities to decide what to test in each iteration.
  • release-readiness -- Release go/no-go decisions should reference the risk heatmap to ensure critical areas are covered.
  • qa-metrics -- Defect escape rate and defect clustering metrics feed back into risk reassessment.
  • qa-project-context -- The project context file captures risk-relevant information (critical flows, known fragile areas, dependencies) that this skill consumes.
Weekly Installs
11
GitHub Stars
4
First Seen
Apr 1, 2026
Installed on
amp10
cline10
opencode10
cursor10
kimi-cli10
warp10