test-data-management by petrkindlmann/qa-skills

Discovery Questions

Before designing a test data strategy, understand the current state. Check .agents/qa-project-context.md first -- if it exists, use it as the foundation and skip questions already answered there.

Current Data Practices

How is test data created today? (manually, scripts, copy of production, none)
Do tests share data or does each test create its own?
How is test data cleaned up? (truncate, rollback, manual, never)
Are there seed scripts? Are they idempotent?

Privacy and Compliance

Does the product handle PII? (names, emails, addresses, phone numbers, SSNs)
Are there GDPR, HIPAA, PCI-DSS, or other data protection requirements?
Is production data ever used in test environments?

Scale and Complexity

How large are the test datasets? (dozens of records, thousands, millions)
How complex are the data relationships? (simple CRUD, deep nested hierarchies, polymorphic)
Are there cross-service data dependencies? (microservices sharing data)

Core Principles

1. Each Test Owns Its Data

Tests that rely on pre-existing shared data are fragile. When Test A modifies shared data, Test B breaks. Every test should create exactly the data it needs, verify against that data, and clean up after itself. This enables parallel execution and eliminates ordering dependencies.

2. Factories Over Fixtures for Dynamic Data

Static fixtures (JSON/YAML files) are appropriate for reference data that does not change (country codes, currency lists). For entity data that tests create and manipulate (users, orders, products), use factory functions that generate fresh instances with sensible defaults and allow per-test overrides.

3. Anonymize Production Data Before Use

Production databases contain the most realistic data, but they also contain real user information. Never copy production data to test environments without anonymization. Replace PII with synthetic equivalents while preserving data distributions and relationships.

4. Deterministic Data Enables Reproducible Tests

Tests should produce the same results regardless of when or where they run. Avoid Math.random(), Date.now(), or auto-increment IDs in assertions. Use seeded random generators, fixed timestamps, and explicit IDs where possible.

5. Minimize Data, Maximize Signal

Create only the data each test needs. A test for user search does not need a complete user profile with billing address, payment method, and order history. Over-specified test data obscures the intent of the test and increases maintenance burden.

Factory Patterns

Factories are functions that produce test data with sensible defaults, allowing individual tests to override only what matters for their scenario.

Fishery (TypeScript)

Fishery is the recommended factory library for TypeScript projects. It provides type safety, traits, sequences, associations, and transient parameters.

npm install --save-dev fishery @faker-js/faker

// tests/factories/user.factory.ts
import { Factory } from 'fishery';
import { faker } from '@faker-js/faker';

interface User {
  id: string;
  email: string;
  name: string;
  role: 'admin' | 'member' | 'viewer';
  organizationId: string;
  createdAt: Date;
  isActive: boolean;
}

export const userFactory = Factory.define<User>(({ sequence, params }) => ({
  id: `user-${sequence}`,
  email: `user-${sequence}@test.example.com`,
  name: faker.person.fullName(),
  role: params.role ?? 'member',
  organizationId: params.organizationId ?? `org-${sequence}`,
  createdAt: new Date('2025-01-15T10:00:00Z'),
  isActive: true,
}));

// Trait variants
const adminUser = userFactory.params({ role: 'admin' });
const orgMembers = userFactory.params({ organizationId: 'org-shared' });

Using in Tests

import { userFactory } from '../factories/user.factory';

const user = userFactory.build();                                        // Sensible defaults
const admin = userFactory.build({ role: 'admin' });                      // Override specific fields
const users = userFactory.buildList(5);                                   // Build multiple
const orgMembers = userFactory.buildList(3, { organizationId: 'org-1' }); // With associations

Associations Between Factories

// tests/factories/order.factory.ts
import { Factory } from 'fishery';
import { userFactory } from './user.factory';

interface Order {
  id: string;
  userId: string;
  items: Array<{ productId: string; quantity: number; unitPrice: number }>;
  totalCents: number;
  status: 'pending' | 'paid' | 'shipped' | 'delivered' | 'cancelled';
}

export const orderFactory = Factory.define<Order>(({ sequence }) => {
  const items = [{ productId: `prod-${sequence}`, quantity: 2, unitPrice: 1999 }];
  return {
    id: `order-${sequence}`,
    userId: userFactory.build().id,
    items,
    totalCents: items.reduce((sum, i) => sum + i.quantity * i.unitPrice, 0),
    status: 'pending',
  };
});

FactoryBot (Ruby)

# spec/factories/users.rb
FactoryBot.define do
  factory :user do
    sequence(:email) { |n| "user-#{n}@test.example.com" }
    name { Faker::Name.name }
    role { :member }
    organization
    trait :admin do role { :admin } end
    trait :inactive do is_active { false } end
  end
end

# Usage: create(:user), create(:user, :admin), create_list(:user, 3, :inactive)

Factory Boy (Python)

# tests/factories.py
import factory
from myapp.models import User

class UserFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = User
    email = factory.Sequence(lambda n: f"user-{n}@test.example.com")
    name = factory.Faker("name")
    role = "member"
    class Params:
        admin = factory.Trait(role="admin")
        inactive = factory.Trait(is_active=False)

# Usage: UserFactory(), UserFactory(admin=True), UserFactory.create_batch(3, inactive=True)

When to Use Factories vs Fixtures

Scenario	Factories	Static Fixtures
Entity data that tests create/modify	Yes	No
Reference data (countries, currencies, configs)	No	Yes
Data with many variations per test	Yes	No -- file explosion
Data with complex relationships	Yes -- associations	No -- hard to maintain
API response mocks	No	Yes -- JSON fixtures
Snapshot/golden file comparisons	No	Yes

Decision rule: If the data has a lifecycle (created, modified, deleted during tests), use a factory. If the data is read-only reference material, use a fixture file.

Fixture Strategies

Static Fixtures (JSON/YAML)

Best for API response mocks, configuration data, and golden file comparisons.

// Using JSON fixtures in Playwright tests
import productsResponse from '../fixtures/data/api-responses/products.json';

test('displays products from API', async ({ page }) => {
  await page.route('**/api/products*', async (route) => {
    await route.fulfill({ status: 200, contentType: 'application/json', body: JSON.stringify(productsResponse) });
  });
  await page.goto('/products');
  await expect(page.getByText('Widget')).toBeVisible();
});

Dynamic Fixtures (Playwright)

Use Playwright fixtures to create and clean up data per test:

// e2e/fixtures/data.fixture.ts
import { test as base, expect } from '@playwright/test';

export const test = base.extend<{ testOrder: { id: string; userId: string } }>({
  testOrder: async ({ request }, use) => {
    const response = await request.post('/api/test/orders', {
      data: { userId: `test-user-${Date.now()}`, items: [{ productId: 'prod-1', quantity: 1 }] },
    });
    expect(response.ok()).toBeTruthy();
    const order = await response.json();
    await use(order);
    await request.delete(`/api/test/orders/${order.id}`);
  },
});

Fixture Composition

Compose fixtures from smaller, reusable pieces by combining factory-generated data with Playwright fixtures:

// e2e/fixtures/composed.fixture.ts
import { test as base } from '@playwright/test';
import { userFactory } from '../factories/user.factory';
import { orderFactory } from '../factories/order.factory';

export const test = base.extend<{ seedData: { user: { id: string }; orders: Array<{ id: string }> } }>({
  seedData: async ({ request }, use) => {
    const resp = await request.post('/api/test/seed', {
      data: { user: userFactory.build(), orders: orderFactory.buildList(3) },
    });
    const seedData = await resp.json();
    await use(seedData);
    await request.post('/api/test/cleanup', { data: { userId: seedData.user.id } });
  },
});

Data Anonymization

When production data is needed for realistic testing, anonymize it before use.

PII Masking Rules

Data Type	Anonymization Method	Example
Email	Faker email with original domain pattern	`jane.doe@acme.com` -> `user-7291@test.example.com`
Full name	Faker name	`Jane Doe` -> `Alice Johnson`
Phone number	Faker phone, preserve format	`+1-555-123-4567` -> `+1-555-987-6543`
Address	Faker address, preserve country/region	`123 Main St, NYC` -> `456 Oak Ave, NYC`
SSN/National ID	Test pattern	`123-45-6789` -> `000-00-0001`
Credit card	Test card numbers	`4111-...` -> `4242-4242-4242-4242`
Date of birth	Shift by fixed offset	`1990-03-15` -> `1987-07-22`

Anonymization with Faker.js

// scripts/anonymize.ts
import { faker } from '@faker-js/faker';
faker.seed(42); // Deterministic output across runs

function anonymizeUser(user: Record<string, unknown>, index: number) {
  return {
    ...user,
    email: `user-${index + 1}@test.example.com`,
    name: faker.person.fullName(),
    phone: faker.phone.number(),
    dateOfBirth: faker.date.birthdate({ min: 18, max: 80, mode: 'age' }),
    ssn: `000-00-${String(index + 1).padStart(4, '0')}`,
  };
}

Referential Integrity During Anonymization

Anonymizing a user's email must also update their email in orders, comments, audit logs, and every other table that references it. Build an anonymization pipeline that:

Maps original values to anonymized values in a lookup table
Processes parent records first, then child records using the same lookup
Validates referential integrity after anonymization
Runs in a transaction so partial anonymization cannot occur

GDPR Compliance Checklist

No real PII exists in any non-production environment
Anonymization is irreversible (no lookup table mapping back to originals is stored)
Anonymization preserves data distributions (age ranges, geographic spread) for realistic testing
Anonymized data cannot be re-identified through combination of quasi-identifiers
Data retention policies apply to test environments (auto-delete after N days)
The anonymization pipeline runs automatically, not manually (eliminates human error)

Database Seeding

Idempotent Seed Scripts

Seed scripts should be safe to run multiple times without duplicating data. Use upsert patterns (ON CONFLICT ... DO UPDATE) for idempotency.

Per-Test vs Per-Suite Data

Strategy	When to Use	Pros	Cons
Per-test setup/teardown	Tests that modify data	Full isolation, parallel-safe	Slower, more setup code
Per-suite seed	Read-only reference data	Fast, simple	Cannot be modified by tests
Per-worker seed	Playwright parallel workers	Balances speed and isolation	Requires worker-scoped fixtures
Global seed	Environment bootstrap	Runs once, sets up baseline	Must be idempotent, shared state risk

Cleanup Strategies

Transaction Rollback (Fastest): Wrap each test in a transaction and roll back after. Works for unit and integration tests with direct DB access.

let tx: Transaction;
beforeEach(async () => { tx = await db.beginTransaction(); });
afterEach(async () => { await tx.rollback(); });

Truncation (Thorough): Delete all data from test tables between suites. Use TRUNCATE TABLE ... CASCADE for efficiency.

API-Based Cleanup (E2E Tests): For E2E tests that cannot access the database directly, register resources for cleanup via a fixture:

export const test = base.extend<{ cleanup: (id: string, type: string) => void }>({
  cleanup: async ({ request }, use) => {
    const toClean: Array<{ id: string; type: string }> = [];
    await use((id, type) => toClean.push({ id, type }));
    for (const r of toClean.reverse()) {
      await request.delete(`/api/test/${r.type}/${r.id}`);
    }
  },
});

Synthetic Data Generation

Edge Case Distributions

Factories should make it easy to generate edge case data:

// tests/factories/edge-cases.ts
export const edgeCaseStrings = [
  '',                               // Empty string
  '  leading and trailing  ',       // Whitespace
  'a'.repeat(10_000),               // Very long string
  '<script>alert("xss")</script>',  // XSS attempt
  "Robert'); DROP TABLE users;--",  // SQL injection
  '\u0000\u0001\u0002',             // Null/control characters
  '\u202Eoverride\u202C',           // RTL override
];

export const edgeCaseDates = [
  new Date('1970-01-01T00:00:00Z'), // Unix epoch
  new Date('2038-01-19T03:14:07Z'), // 32-bit overflow
  new Date('2024-02-29T00:00:00Z'), // Leap day
  new Date('2025-03-09T02:30:00-05:00'), // During DST transition
];

Boundary Value Generation

export function boundaryValues(min: number, max: number): number[] {
  return [min - 1, min, min + 1, Math.floor((min + max) / 2), max - 1, max, max + 1];
}

// Usage
test.each(boundaryValues(1, 100).map(v => [v]))(
  'validates quantity %i correctly',
  (quantity) => {
    const result = validateQuantity(quantity);
    if (quantity >= 1 && quantity <= 100) {
      expect(result.valid).toBe(true);
    } else {
      expect(result.valid).toBe(false);
    }
  }
);

Anti-Patterns

Shared Mutable Test Data

Multiple tests reading and writing the same database rows. Test A creates a user, Test B modifies it, Test C asserts on the original state and fails. Fix by having each test create its own data through factories.

Production Data Without Anonymization

Copying the production database to staging for "realistic testing." This violates GDPR, risks data breaches in less-secured environments, and creates compliance liability. Always anonymize before use, or generate synthetic data that matches production distributions.

Non-Deterministic Data

Using Math.random() or Date.now() in test data creation without seeding. Tests pass on Monday and fail on Tuesday because the random name generated happens to exceed a field length limit. Use seeded Faker instances and fixed timestamps.

No Cleanup Strategy

Tests that create data and never clean it up. The test database grows until it affects performance, or stale data causes false positives in other tests. Every data creation must have a corresponding cleanup.

Fixture File Explosion

Creating a separate JSON fixture file for every test variation. Instead of user-admin.json, user-inactive.json, user-admin-inactive.json, use a factory with traits. Fixtures should be reserved for static reference data and API response mocks.

Over-Specified Test Data

Creating a complete user object with 30 fields when the test only cares about role. This obscures intent and makes tests brittle. Factories with sensible defaults solve this: override only what the test cares about.

Hard-Coded IDs

Using userId: '1' in tests. This couples tests to database state and breaks when running in parallel (ID collision) or against a database with existing data. Use factory sequences or UUIDs.

Done When

Factory or fixture functions cover all entity types needed by the test suite
Test data isolated per test — no shared mutable state between tests
Seed scripts runnable in CI without manual intervention
No real PII used in test fixtures — all sensitive data anonymized or synthetic
Data cleanup verified after each test run (no orphaned records accumulate)

Related Skills

unit-testing -- Unit tests are the primary consumer of factory-generated data; this skill provides the data layer.
api-testing -- API tests use both factories (for request bodies) and fixtures (for mocked responses).
playwright-automation -- E2E tests need test data seeded via API or fixtures before browser interaction.
test-reliability -- Deterministic test data eliminates a major source of test flakiness.
ci-cd-integration -- Database seeding and cleanup must be integrated into CI pipeline stages.