testing
Testing: Write Tests That Catch Real Bugs
Write, structure, and maintain tests across unit, integration, E2E, accessibility, and performance layers. The goal is tests that catch regressions, document behavior, and run fast in CI - not tests that exist to inflate coverage numbers.
Target versions (April 2026):
- Vitest 4.1.2, Jest 30.3.0
- Playwright 1.59.0, Cypress 15.13.0
- pytest 9.0.2, pytest-cov 7.1.0
- Go 1.26.1 (testing stdlib,
testing/synctestGA) - Rust 1.94.1 (
cargo test, cargo-nextest 0.9.132) - Testing Library 16.3.2 (
@testing-library/react) - axe-core 4.11.2 (
@axe-core/playwright) - Grafana k6 1.7.1 (load testing)
When to use
- Writing new tests (unit, integration, E2E, accessibility, performance)
- Debugging flaky or failing tests
- Designing test architecture for a project (fixture strategies, factory patterns, test data)
- Setting up test infrastructure in CI (parallelization, sharding, coverage gates)
- Choosing testing tools or migrating between test frameworks
- Implementing TDD workflow
- Adding accessibility or visual regression tests to an existing suite
When NOT to use
- Reviewing existing test quality or correctness as part of a code review - use code-review
- Security-specific testing (penetration testing, OWASP checks) - use security-audit
- Cleaning up verbose/sloppy test code - use anti-slop
- Ad-hoc web browsing, scraping, or page interaction outside of tests - use browse
- CI/CD pipeline architecture (test jobs run inside pipelines, but pipeline design is ci-cd's domain) - use ci-cd
- Database testing patterns at the engine level - use databases
- Writing or refining LLM prompts (use prompt-generator)
- Infrastructure or configuration validation outside tests (use terraform, ansible, or kubernetes)
AI Self-Check
AI tools consistently produce the same testing mistakes. Before returning any generated test code, verify against this list:
- Tests assert behavior, not implementation - no testing private methods or internal state
- Each test has exactly one reason to fail (single assertion concept, not single
assertcall) - Test names describe the scenario and expected outcome, not the method name
- Mocks/stubs are scoped to the test - no shared mutable mock state across tests
- No hardcoded ports, paths, or timestamps that break on other machines or in CI
- Async tests properly await all promises/futures - no fire-and-forget assertions
- Test data is isolated - each test creates its own state, no dependency on test execution order
- Cleanup happens even when assertions fail (use
afterEach/teardown/t.Cleanup/Drop) - No
sleep()or fixed delays for async waits - use polling, retries, or event-based waits - Coverage threshold is realistic (80% line coverage is a good default; 100% is a lie)
- Snapshot tests have been reviewed manually before committing (blind
--updateis a bug factory) - E2E selectors use
data-testid,role, or accessible names - not CSS classes or DOM structure
Workflow
Step 1: Determine scope
Based on context:
- New feature -> write tests alongside or before the code (TDD when appropriate)
- Bug fix -> write a failing test first that reproduces the bug, then fix
- Existing untested code -> prioritize critical paths, not 100% coverage
- Test infrastructure -> set up runners, CI config, coverage gates
Identify the project's existing test framework from config files (vitest.config.ts, jest.config.*, pyproject.toml, Cargo.toml, *_test.go, playwright.config.ts). Match it. Don't introduce a second test runner without a reason.
Step 2: Choose the test layer
| Layer | Tests what | Speed | When to use |
|---|---|---|---|
| Unit | Single function/module in isolation | ms | Pure logic, utilities, data transforms, state machines |
| Integration | Multiple modules, real dependencies | seconds | API handlers, database queries, service boundaries |
| E2E | Full user flows through the UI | seconds-minutes | Critical paths, checkout flows, auth, onboarding |
| Accessibility | WCAG compliance, screen reader compat | seconds | Every user-facing component/page |
| Visual | Screenshot comparison | seconds | UI components after style changes |
| Performance | Load, latency, throughput | minutes | Before releases, after arch changes |
The testing pyramid still holds: many unit tests, fewer integration tests, fewest E2E tests. Invert it and your CI takes 45 minutes and everyone ignores test failures.
Step 3: Write the test
Follow the language-specific patterns below. Universal principles:
Arrange-Act-Assert (or Given-When-Then):
// Arrange: set up test data and dependencies
// Act: call the thing being tested
// Assert: verify the outcome
Test naming: describe the scenario, not the function.
# Bad: test_calculate_total
# Good: test_calculate_total_applies_discount_when_cart_exceeds_100
# Good: it("returns 401 when token is expired")
Step 4: Validate
- Run the full test suite: failures in other tests may indicate your change broke something
- Check coverage delta: new code should be covered, but don't chase vanity numbers
- Run in CI if possible - tests that pass locally but fail in CI are the worst kind
TDD Workflow
Use TDD when the behavior is well-defined upfront. Skip it when exploring or prototyping.
- Red: write a test that fails (confirm it fails for the right reason)
- Green: write the minimum code to make the test pass (ugly is fine)
- Refactor: clean up without changing behavior (tests still pass)
TDD works best for: pure functions, data transformations, state machines, API contracts, bug reproduction.
TDD works poorly for: UI layout, exploratory prototyping, integration with undocumented APIs.
Mocking Strategy
Mock at boundaries, not everywhere. Over-mocking produces tests that pass while the real code is broken.
| What to mock | What NOT to mock |
|---|---|
| External APIs (HTTP, gRPC) | Your own pure functions |
| Database (when unit testing) | Data transformations |
| Time/dates, random values | Simple utility code |
| File system (when impractical) | The module under test |
| Third-party SDKs | Standard library functions |
Prefer fakes over mocks when possible. An in-memory database implementation tests more real behavior than a mock that returns canned responses.
Injectable clock for TTL/time-dependent tests - pass a clock dependency rather than calling Date.now() or time.Now() directly:
// Production: clock = () => Date.now()
// Test: clock = () => FIXED_TS + offset
function isExpired(createdAt: number, ttlMs: number, clock = Date.now): boolean {
return clock() - createdAt > ttlMs;
}
// In test: advance virtual time without sleeping
const fakeNow = vi.fn().mockReturnValue(START);
expect(isExpired(START, 1000, fakeNow)).toBe(false);
fakeNow.mockReturnValue(START + 1001);
expect(isExpired(START, 1000, fakeNow)).toBe(true);
Read references/language-patterns.md for language-specific mocking idioms (Vitest vi.mock, Jest jest.mock, pytest monkeypatch, Go interfaces, Rust trait objects).
Test Data and Fixtures
Factory pattern (preferred)
Build test data with sensible defaults and per-test overrides:
// TypeScript - factory function
function buildUser(overrides: Partial<User> = {}): User {
return { id: randomUUID(), name: "Test User", email: "test@example.com", ...overrides };
}
// Python - factory function
def build_user(**overrides) -> User:
defaults = {"id": uuid4(), "name": "Test User", "email": "test@example.com"}
return User(**(defaults | overrides))
Fixture rules
- Isolate per test. Shared mutable fixtures cause order-dependent failures.
- Use builders/factories over raw object literals - defaults prevent test brittleness.
- Database fixtures: use transactions that roll back after each test (pytest
dbfixture, JestbeforeEachwith rollback). Seeded test databases beat shared staging data. - File fixtures: use temp directories (
tmp_pathin pytest,os.MkdirTempin Go,tempfilein Rust). Clean up in teardown.
Accessibility Testing
Catch WCAG violations automatically. Not a replacement for manual testing, but catches the mechanical stuff (missing alt text, broken ARIA, contrast ratios, keyboard traps).
Use @axe-core/playwright - run new AxeBuilder({ page }).withTags(["wcag2a", "wcag2aa"]).analyze() and assert zero violations. Run axe scans on every page/component. Exclude known issues with .exclude() and track them as tech debt, not permanent exceptions.
Read references/e2e-accessibility.md for Playwright E2E patterns, visual regression setup, and CI accessibility gates.
Performance Testing
Two categories: micro-benchmarks (is this function fast enough?) and load tests (does the system handle traffic?).
Micro-benchmarks
- Go:
func BenchmarkX(b *testing.B)- built into the stdlib - Rust:
cargo benchwith criterion (criterion = "0.6") - JS/TS:
vitest benchortinybench - Python:
pytest-benchmarkortimeit
Load testing (k6)
// k6 load test
import http from "k6/http";
import { check, sleep } from "k6";
export const options = {
stages: [
{ duration: "30s", target: 50 }, // ramp up
{ duration: "1m", target: 50 }, // sustain
{ duration: "10s", target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ["p(95)<500"], // 95th percentile under 500ms
},
};
export default function () {
const res = http.get("http://localhost:3000/api/health");
check(res, { "status 200": (r) => r.status === 200 });
sleep(1);
}
Don't run load tests against production without explicit approval. Don't run them in CI unless you have dedicated infrastructure for it.
CI Integration
Test parallelization
- Vitest/Jest: built-in worker parallelism. Vitest uses Vite's module graph for smart test file distribution.
- Playwright:
--shard=1/4for splitting across CI runners.--workers=4for parallel within a runner. - pytest:
pytest-xdistwith-n autofor CPU-based parallelism. - Go:
go test -parallel Nper package,-p Nfor package-level parallelism. - Rust:
cargo nextest runfor per-test process isolation and parallelism.
Flaky test management
Flaky tests erode trust. Fix or quarantine immediately.
- Identify: track test stability over time (most CI systems have flaky test dashboards)
- Quarantine: move to a separate job that doesn't block merges. Tag with
@flakyorskip. - Fix root causes - common culprits by framework:
- Playwright/Cypress: race conditions on navigation or animation. Use
waitForLoadState,waitForSelector, or Playwright's auto-waiting. Avoidpage.waitForTimeout. Stub network requests to eliminate backend variability. Headless mode (CI) has different rendering timing than headed - animations may be skipped or font metrics differ; use--headedlocally to reproduce CI-only failures. - Vitest/Jest: shared module state between test files. Use
--pool forks(Vitest) or--runInBandto isolate. Check for leaked timers (vi.useFakeTimersnot restored). - pytest: database state leaking between tests. Use
@pytest.mark.usefixtures("db")with transactional rollback. Check for global state mutation in fixtures. - Go:
t.Parallel()tests sharing package-level state. Uset.Cleanupfor teardown. Check for goroutine leaks withgoleak.
- Playwright/Cypress: race conditions on navigation or animation. Use
- Retry with caution:
--retries 2(Playwright) or--reruns 2(pytest-rerunfailures) is a bandaid, not a fix
Coverage thresholds
Set coverage gates in CI. Reasonable defaults:
| Metric | Threshold | Why |
|---|---|---|
| Line coverage | 80% | Catches obvious gaps |
| Branch coverage | 70% | Catches untested conditions |
| New code coverage | 90% | Prevents coverage erosion |
Enforce via vitest --coverage --coverage.thresholds.lines=80, pytest --cov --cov-fail-under=80, or go test -coverprofile + threshold script.
Minimal CI example (pytest + GitHub Actions):
- run: pip install pytest pytest-xdist pytest-cov
- run: pytest -n auto --cov=src --cov-fail-under=80 --tb=short
Reference Files
references/language-patterns.md- language-specific test patterns for JS/TS (Vitest, Jest), Python (pytest), Go (testing stdlib), and Rust (cargo test). Covers mocking, table-driven tests, async testing, snapshot testing, and framework-specific idioms.references/e2e-accessibility.md- E2E testing with Playwright, visual regression (screenshot comparison, component snapshots), accessibility testing patterns, and CI integration for browser tests.
Related Skills
- code-review - reviews test quality and correctness as part of code reviews. This skill writes the tests; code-review evaluates whether they actually test the right things.
- security-audit - handles security-specific testing (OWASP, penetration testing, credential scanning). This skill handles functional testing.
- anti-slop - cleans up verbose, over-abstracted, or AI-generated test code. If the test works but reads like a novel, route to anti-slop.
- ci-cd - designs the pipeline that runs tests. This skill writes the tests and configures test runners; ci-cd handles the pipeline structure around them.
- databases - covers database engine testing and configuration. This skill handles application-level database test patterns (transactions, fixtures, test data).
Rules
- Test behavior, not implementation. Tests coupled to internal structure break on every refactor and catch zero bugs. If a test mocks 8 things and asserts a method was called with specific args, it's testing the mock, not the code.
- No
sleep()in tests. UsewaitFor,Eventually,poll, retry loops, or event-based synchronization. Fixed delays are flaky by definition. - Isolate test state. Each test creates its own data, runs independently, and cleans up after itself. Shared mutable state between tests is the #1 cause of order-dependent failures.
- Fix or quarantine flaky tests immediately. A test suite people ignore is worse than no test suite. Track flaky tests, fix root causes, don't just retry.
- Don't test the framework. Testing that React renders a div, or that Express routes to a handler, is testing someone else's code. Test YOUR logic.
- Run the AI self-check. Every generated test gets verified against the checklist before returning. AI-generated tests love to test implementation details, use
sleep(), and share state. - Match the existing framework. Don't introduce Vitest into a Jest project or pytest into a unittest project without the user explicitly asking for a migration.
- Snapshot tests require manual review. Never auto-update snapshots (
-u/--update) without reviewing the diff. Blind snapshot updates are equivalent to deleting the test.