testing-in-production
Discovery Questions
Check .agents/qa-project-context.md first. If it exists, use it as context and skip questions already answered there.
Feature flag system:
- Do you have a feature flag platform? (LaunchDarkly, Unleash, Flagsmith, Split, custom, none)
- How are flags managed? (Dashboard, config file, environment variables)
- Can flags target specific users, percentages, or segments?
- How many active flags exist today? Is there a cleanup process?
Rollout capability:
- Can you deploy to a subset of traffic? (Canary infrastructure, weighted routing, feature flags)
- How long does a deployment take? How long does a rollback take?
- Do you have blue-green or rolling deployments?
- Can you route traffic by region, user cohort, or percentage?
Monitoring maturity:
- What observability is in place? (APM, logging, error tracking, metrics)
- Do you have dashboards for error rate, latency, and business metrics?
- Are alerts configured with appropriate thresholds?
- Can you compare metrics between canary and baseline in real time?
Production access and safety:
- Who has production access? Is there an approval process?
- Are there dedicated test accounts in production?
- Can you run operations in production without affecting real user data?
- Is there a production incident response process?
Core Principles
1. Production is the final test environment
Staging approximates production. It does not replicate production's data volume, traffic patterns, third-party integrations, infrastructure quirks, or user behavior. Testing in production is not reckless -- it is realistic. The question is not whether to test in production, but how to do it safely.
2. Safety through blast radius control
Every production test must answer: "If this goes wrong, how many users are affected?" The answer must be as small as possible. Feature flags, canary deploys, and traffic splitting exist to shrink the blast radius from 100% to 1% or less.
3. Always have a rollback plan
Before any production test begins, the rollback mechanism must be identified, tested, and fast. "Disable the flag" is a good rollback plan. "Redeploy the previous version" is acceptable. "We'll figure it out" is not a plan.
4. Monitoring is a prerequisite, not a nice-to-have
You cannot test in production without monitoring. If you cannot measure error rates, latency, and business metrics in real time, you cannot detect problems. Fix monitoring gaps before adding production tests.
5. Production tests must be non-destructive
Production tests must never corrupt real user data, send real notifications to real users, charge real payment methods, or create side effects that require manual cleanup. Synthetic accounts, test flags, and isolated resources are mandatory.
Feature Flag Testing
Feature flags are the safest mechanism for production testing. They decouple deployment from release and provide instant rollback.
Test with flags ON and OFF
Every flagged feature needs tests in both states. The flag-off path is the rollback path and must work flawlessly.
// Test the feature-on experience
test('new checkout flow renders when flag is enabled', async ({ page }) => {
await setFeatureFlag('new-checkout', true, { userId: TEST_USER_ID });
await page.goto('/checkout');
await expect(page.getByRole('heading', { name: 'Express Checkout' })).toBeVisible();
await expect(page.getByRole('button', { name: 'Pay with saved card' })).toBeEnabled();
});
// Test the feature-off fallback
test('legacy checkout flow renders when flag is disabled', async ({ page }) => {
await setFeatureFlag('new-checkout', false, { userId: TEST_USER_ID });
await page.goto('/checkout');
await expect(page.getByRole('heading', { name: 'Checkout' })).toBeVisible();
await expect(page.getByRole('form', { name: 'Payment details' })).toBeVisible();
});
Flag lifecycle testing
Flags are not just on or off. They transition through states, and each transition must be validated.
Flag lifecycle:
Created → Targeting internal users → Canary (1%) → Partial (10-50%) → Full (100%) → Cleanup (removed)
Test at each stage:
- Internal: Feature works for internal accounts, hidden from external
- Canary: Metrics are comparable between flag-on and flag-off cohorts
- Partial: No performance degradation at scale
- Full: All user segments work correctly
- Cleanup: Code with flag removed behaves identically to flag-on
Stale flag cleanup
Flags left in code become technical debt. Run a weekly CI job that queries the flag provider for flags that are 100% rolled out and older than 14 days. These are candidates for code cleanup -- the flag branching logic should be removed and only the enabled path retained.
Flag combination testing
When multiple flags interact, test the combinations that matter. Do not test all 2^N combinations -- focus on flags that affect the same user flow.
// Identify interacting flags by feature area
const checkoutFlags = ['new-checkout', 'express-pay', 'discount-engine-v2'];
// Test the critical combinations
const criticalCombinations = [
{ 'new-checkout': true, 'express-pay': true, 'discount-engine-v2': true }, // all new
{ 'new-checkout': true, 'express-pay': false, 'discount-engine-v2': true }, // mixed
{ 'new-checkout': false, 'express-pay': false, 'discount-engine-v2': false }, // all legacy
];
for (const combo of criticalCombinations) {
test(`checkout with flags: ${JSON.stringify(combo)}`, async ({ page }) => {
for (const [flag, value] of Object.entries(combo)) {
await setFeatureFlag(flag, value, { userId: TEST_USER_ID });
}
await page.goto('/checkout');
// Assert checkout completes without errors
await page.getByRole('button', { name: /place order/i }).click();
await expect(page.getByText(/order confirmed/i)).toBeVisible();
});
}
Progressive Rollout
Canary stages: 1% to 100%
A structured rollout with explicit promotion criteria at each stage.
| Stage | Traffic | Hold Time | Key Checks |
|---|---|---|---|
| Canary | 1% | 15-30 min | Error rate, crash rate, exceptions |
| Early adopters | 10% | 1-2 hours | Latency P95, conversion rate |
| Partial | 50% | 2-4 hours | All guardrails, business metrics |
| Full | 100% | 24 hours monitoring | Long-tail issues, batch job compatibility |
Automated promotion criteria
Define machine-checkable conditions for advancing between stages. Human override remains available but should be rare.
# rollout-policy.yaml
canary_to_10_percent:
hold_duration: 30m
conditions:
- metric: error_rate_5xx
comparison: less_than
threshold: 0.5%
window: 15m
- metric: latency_p95
comparison: less_than
threshold: 500ms
window: 15m
- metric: crash_rate
comparison: equals
threshold: 0
window: 15m
10_percent_to_50_percent:
hold_duration: 2h
conditions:
- metric: error_rate_5xx
comparison: less_than
threshold: 0.5%
window: 1h
- metric: latency_p95
comparison: less_than
threshold: 500ms
window: 1h
- metric: conversion_rate
comparison: within_percentage
baseline: pre_deploy_average
tolerance: 5%
window: 1h
50_percent_to_100_percent:
hold_duration: 4h
conditions:
- metric: error_rate_5xx
comparison: less_than
threshold: 0.3%
window: 2h
- metric: all_guardrails
comparison: passing
window: 2h
- metric: customer_reported_issues
comparison: equals
threshold: 0
Rollback triggers
Automatic rollback fires when guardrails are breached. No human approval needed.
automatic_rollback:
- condition: error_rate_5xx > 2x_baseline
for: 5m
action: rollback_to_previous
notify: [oncall-slack, pagerduty]
- condition: latency_p99 > 3x_baseline
for: 5m
action: rollback_to_previous
notify: [oncall-slack]
- condition: crash_rate > 0.1%
for: 2m
action: rollback_to_previous
notify: [oncall-slack, pagerduty, engineering-leads]
- condition: health_check_failures > 3_consecutive
action: rollback_immediately
notify: [oncall-slack, pagerduty]
Production Smoke Tests
Post-deploy critical path tests
Run immediately after every deployment. These verify that the application's core functionality works with production configuration, data, and infrastructure.
// production-smoke.spec.ts
import { test, expect } from '@playwright/test';
const PROD_URL = process.env.PRODUCTION_URL!;
const SMOKE_USER = process.env.SMOKE_TEST_EMAIL!;
const SMOKE_PASS = process.env.SMOKE_TEST_PASSWORD!;
test.describe('Production Smoke', () => {
test.describe.configure({ retries: 1, timeout: 30_000 });
test('application loads and responds', async ({ request }) => {
const health = await request.get(`${PROD_URL}/api/health`);
expect(health.ok()).toBeTruthy();
const body = await health.json();
expect(body.status).toBe('healthy');
expect(body.version).toBeDefined();
});
test('authentication flow works', async ({ page }) => {
await page.goto(`${PROD_URL}/login`);
await page.getByLabel('Email').fill(SMOKE_USER);
await page.getByLabel('Password').fill(SMOKE_PASS);
await page.getByRole('button', { name: 'Sign in' }).click();
await expect(page).toHaveURL(/dashboard/);
await expect(page.getByRole('heading', { name: /dashboard/i })).toBeVisible();
});
test('core data loads correctly', async ({ page }) => {
// Assumes auth state from storageState
await page.goto(`${PROD_URL}/dashboard`);
await expect(page.getByRole('table')).toBeVisible();
await expect(page.getByRole('row')).not.toHaveCount(0);
});
test('search returns results', async ({ page }) => {
await page.goto(`${PROD_URL}/search`);
await page.getByRole('searchbox').fill('test query');
await page.getByRole('button', { name: 'Search' }).click();
await expect(page.getByRole('listitem')).not.toHaveCount(0);
});
});
Synthetic user accounts
Production test accounts must be clearly distinguishable from real users.
Synthetic account conventions:
- Email pattern: smoke-test+{env}@yourcompany.com
- Display name: "[SYNTHETIC] Smoke Test User"
- Flag: is_synthetic = true in user record
- Excluded from: analytics, billing, email campaigns, support queues
- Data isolation: test data uses reserved ID ranges or namespaces
Account management:
- Create accounts via admin API, not through the UI
- Rotate credentials quarterly
- Store credentials in secrets manager (Vault, AWS Secrets Manager)
- Never reuse synthetic accounts across different test suites
Non-destructive assertions
Production smoke tests must read, not write. When writes are unavoidable, clean up immediately.
// Pattern: create-verify-cleanup
test('can create and delete a draft', async ({ page }) => {
await page.goto(`${PROD_URL}/documents`);
// Create
await page.getByRole('button', { name: 'New document' }).click();
await page.getByLabel('Title').fill('[SMOKE TEST] Auto-cleanup');
await page.getByRole('button', { name: 'Save draft' }).click();
const url = page.url();
// Verify
await expect(page.getByText('[SMOKE TEST] Auto-cleanup')).toBeVisible();
// Cleanup -- always runs, even on test failure
test.afterEach(async ({ request }) => {
const docId = url.split('/').pop();
await request.delete(`${PROD_URL}/api/documents/${docId}`, {
headers: { Authorization: `Bearer ${process.env.SMOKE_TEST_TOKEN}` },
});
});
});
Guardrail Metrics
What to monitor during rollout
| Category | Metric | Comparison Method | Alert Threshold |
|---|---|---|---|
| Errors | HTTP 5xx rate | vs. pre-deploy baseline | >2x baseline for 5 min |
| Errors | Unhandled exception count | vs. pre-deploy baseline | Any new exception type |
| Latency | P50 response time | vs. pre-deploy baseline | >1.5x baseline |
| Latency | P95 response time | vs. pre-deploy baseline | >2x baseline |
| Latency | P99 response time | vs. pre-deploy baseline | >3x baseline |
| Business | Conversion rate | vs. 7-day average | Drop >5% |
| Business | Revenue per session | vs. 7-day average | Drop >10% |
| Client | Crash rate (mobile) | vs. previous release | >0.1% increase |
| Client | JavaScript error rate | vs. pre-deploy baseline | >2x baseline |
| Infra | CPU utilization | absolute | >80% sustained |
| Infra | Memory utilization | absolute | >85% sustained |
Baseline comparison
Compare canary metrics against a control group running the previous version, not against historical data alone.
Comparison approaches (best to worst):
1. Canary vs. control: split traffic, compare groups in real time (best)
2. Before/after: compare post-deploy metrics to pre-deploy window (good)
3. Historical: compare to same time last week (acceptable for trends)
4. Absolute thresholds: fixed thresholds regardless of baseline (fragile)
Statistical significance
For business metrics (conversion, revenue), small sample sizes produce noisy results. Wait for statistical significance before drawing conclusions.
Minimum sample sizes for rollout decisions:
- Error rate: 1,000 requests (errors are rare events, need volume)
- Latency: 500 requests (more stable, converges faster)
- Conversion rate: 5,000 sessions (business metrics have high variance)
- Crash rate: 10,000 app launches (crashes are rare events)
Rule of thumb: if you don't have enough traffic at 1% to reach
significance in 30 minutes, increase to 5% or extend the hold window.
Dark Launches
Dark launches deploy new functionality to production but hide it from users. Real production traffic exercises the new code path without user-visible impact.
Traffic shadowing
Duplicate incoming requests to the new service. Compare responses without returning the new response to the user.
Request flow:
User → Load Balancer → Production Service (returns response to user)
↘ Shadow Service (processes request, logs result, discards)
What to compare:
- Response status codes: shadow should match production
- Response body: diff for semantic equivalence (ignore timestamps, IDs)
- Latency: shadow should not be significantly slower
- Error rate: shadow should not produce more errors
Parallel execution
For migrations (new database, new algorithm, new service), run both the old and new path in production. The old path returns the result to the user; the new path runs asynchronously, logs differences, and discards its result. Track the match rate over time -- target 99%+ match before cutting over.
Shadow launch timeline:
Week 1: Deploy shadow, start comparing, expect <50% match
Week 2: Fix mismatches, match rate should climb to 90%+
Week 3: Match rate stable at 99%+, handle remaining edge cases
Week 4: Cut over: shadow becomes primary, old becomes shadow
Week 5: Remove old path after 1 week of stability
Anti-Patterns
Testing in production without monitoring
Running production tests without dashboards and alerts is flying blind. You will not know if your tests caused an issue until a user reports it.
Fix: Monitoring is a prerequisite. Before adding any production test, verify you can see error rates, latency, and key business metrics in real time. Set up alerts before the first test runs.
No rollback plan
"We'll deploy a fix if something goes wrong" is not a rollback plan. Under pressure, fixes take longer, introduce new bugs, and extend the outage.
Fix: Every production test or rollout must have a documented rollback mechanism that takes less than 5 minutes to execute. Feature flag disable, previous deployment, or traffic reroute.
Destructive operations in production tests
Production tests that create real orders, send real emails, or modify real user data are not tests -- they are incidents waiting to happen.
Fix: Use synthetic accounts flagged as test data. Use sandbox modes for payment and email. Clean up any created data immediately. If a test cannot be made non-destructive, it does not belong in production.
Testing in production instead of pre-production
Production testing supplements pre-production testing. It does not replace it. If your staging environment is broken and you are "testing in production" because it is the only working environment, fix staging first.
Fix: Maintain a working pre-production environment. Use production testing for what only production can validate: real traffic, real data volumes, real third-party integrations.
Canary deploys without comparison
Deploying to 1% of traffic but not comparing canary metrics against a control group misses the entire point. You are just deploying slowly, not detecting problems.
Fix: Always compare canary metrics against a baseline. Use side-by-side dashboards or automated canary analysis tools (Kayenta, Argo Rollouts analysis).
Stale feature flags
Flags that are fully rolled out but never removed accumulate. After a year, you have 200 flags with unknown interactions, and every code path has branching logic that nobody understands.
Fix: Every flag gets an expiration date at creation time. After full rollout + 2 weeks of stability, remove the flag. Track flag age and alert when flags exceed their expiration.
Done When
- Feature flag rollout plan is documented with explicit percentage steps (1% → 10% → 50% → 100%) and named guardrail metrics at each stage
- Canary analysis is configured with automated pass/fail criteria so promotion and rollback decisions do not require manual metric comparison
- Production smoke tests run as a pipeline stage on every deploy (not only in CI pre-deploy)
- Rollback trigger conditions are defined, documented, and verified to fire correctly (e.g., tested in staging before first production use)
- Production test data strategy is documented, specifying whether synthetic users or anonymized real users are used and how they are excluded from analytics and billing
Related Skills
| Skill | Relationship |
|---|---|
release-readiness |
Production testing is part of the post-deploy verification in the release process |
synthetic-monitoring |
Ongoing production validation after the rollout is complete |
observability-driven-testing |
Traces and logs from production inform which tests to write |
qa-metrics |
Guardrail metrics and rollout criteria feed into QA dashboards |
ci-cd-integration |
Production smoke tests run as a CI pipeline stage post-deployment |
test-environments |
Production testing complements, not replaces, pre-production environments |