AI Testing Guidelines

Principles for writing tests that are reliable, maintainable, and actually catch bugs.

1. Test Design

Define Purpose and Scope

Each test should have a clear, singular purpose:

Unit test: Isolated function logic
Integration test: Component interactions
Error path test: Specific failure handling
Success path test: Happy path behavior

Don't mix concerns. A test that checks both success and error handling is two tests.

Test Behaviors, Not Methods

Focus each test on a single, specific behavior:

# Good: Tests one behavior
def test_withdrawal_with_insufficient_funds_raises_error():
    account = Account(balance=50)
    with pytest.raises(InsufficientFundsError):
        account.withdraw(100)

# Bad: Tests multiple behaviors
def test_account():
    account = Account(balance=100)
    account.withdraw(50)
    assert account.balance == 50
    with pytest.raises(InsufficientFundsError):
        account.withdraw(100)
    account.deposit(200)
    assert account.balance == 250

Test the Right Layer

Layer	Test Type	Test Double Strategy
Business logic	Unit test	Fake or mock all I/O
API endpoints	Integration test	Fake external services
Database queries	Integration test	Use test database or fake
Full workflows	E2E test	Minimal faking

Test via Public APIs

Exercise code through its intended public interface:

# Good: Uses public API
result = service.process_order(order)
assert result.status == "completed"

# Bad: Testing private implementation
service._validate_order(order)  # Don't test private methods
service._internal_cache["key"]  # Don't inspect internals

If a private method needs testing, refactor it into its own component with a public API.

Verify Assumptions Explicitly

Don't assume framework behavior. If you expect validation to reject bad input, write a test that proves it.

2. Test Structure

Arrange, Act, Assert

Organize every test into three distinct phases:

def test_create_user_success():
    # Arrange: Set up preconditions
    user_data = {"name": "Alice", "email": "alice@example.com"}
    service = UserService(fake_database)

    # Act: Execute the operation
    user = service.create_user(user_data)

    # Assert: Verify outcomes
    assert user.id is not None
    assert user.name == "Alice"
    assert fake_database.get_user(user.id) == user

Keep phases distinct. Don't do more setup after the Act phase.

No Logic in Tests

Tests should be trivially correct upon inspection:

# Good: Explicit expected value
assert calculate_total(items) == 150.00

# Bad: Logic that could have bugs
expected = sum(item.price * item.quantity for item in items)
assert calculate_total(items) == expected

Avoid loops, conditionals, and calculations in test bodies. If you need complex setup, extract it to a helper function (and test that helper).

Clear Naming

Test names should describe the behavior and expected outcome:

test_[method_name_]context_expected_result

test_withdraw_insufficient_funds_raises_error
test_create_user_duplicate_email_returns_conflict
test_get_order_not_found_returns_none

Setup Methods and Fixtures

Use fixtures for implicit, safe defaults shared by many tests:

@pytest.fixture
def fake_database():
    return FakeDatabase()

@pytest.fixture
def user_service(fake_database):
    return UserService(fake_database)

Critical: Tests must not rely on specific values from fixtures. If a test cares about a specific value, set it explicitly in the test.

3. Test Doubles: Prefer Fakes Over Mocks

The Test Double Hierarchy

Real implementations - Use when fast, deterministic, simple
Fakes - Lightweight working implementations (preferred)
Stubs - Predetermined return values (use sparingly)
Mocks - Record interactions (use as last resort)

Why Fakes Beat Mocks

Aspect	Fakes	Mocks
State verification	✅ Check actual state	❌ Only verify calls made
Brittleness	Low - tests the contract	High - coupled to implementation
Readability	Clear state setup	Complex mock configuration
Realism	Behaves like real thing	May hide real issues
Maintenance	Centralized, reusable	Scattered across tests

Fake Example

class FakeUserRepository:
    def __init__(self):
        self._users = {}

    def save(self, user: User) -> None:
        self._users[user.id] = user

    def get(self, user_id: str) -> User | None:
        return self._users.get(user_id)

    def delete(self, user_id: str) -> bool:
        if user_id in self._users:
            del self._users[user_id]
            return True
        return False

# Test using fake - can verify actual state
def test_create_user_saves_to_repository():
    fake_repo = FakeUserRepository()
    service = UserService(fake_repo)

    user = service.create_user({"name": "Alice"})

    # State verification - check actual result
    saved_user = fake_repo.get(user.id)
    assert saved_user is not None
    assert saved_user.name == "Alice"

When Mocks Are Acceptable

No fake exists and creating one is disproportionate effort
Testing specific interaction order is critical (e.g., caching behavior)
The SUT's contract is defined by interactions (e.g., event emission)

Mock Anti-Patterns

Don't mock types you don't own:

# Bad: Mocking external library directly
mock_requests = MagicMock()
mock_requests.get.return_value.json.return_value = {"data": "value"}

# Good: Create a wrapper you own, mock that
class HttpClient:
    def get_json(self, url: str) -> dict:
        return requests.get(url).json()

# Now mock your wrapper
mock_client = MagicMock(spec=HttpClient)
mock_client.get_json.return_value = {"data": "value"}

Don't mock value objects:

# Bad: Mocking simple data
mock_date = MagicMock()
mock_date.year = 2024

# Good: Use real value
real_date = date(2024, 1, 15)

Don't mock indirect dependencies:

# Bad: Mocking a dependency of a dependency
mock_connection = MagicMock()  # Used by repository, not by service
service = UserService(UserRepository(mock_connection))

# Good: Mock or fake the direct dependency
fake_repo = FakeUserRepository()
service = UserService(fake_repo)

4. Assertions

Assert Observable Behavior

Base assertions on actual outputs and side effects:

# Good: Assert observable result
assert result.status == "completed"
assert len(result.items) == 3

# Bad: Assert implementation detail
assert service._internal_cache["key"] == expected

Prefer State Verification Over Interaction Verification

# Good: Verify the actual state change
user = service.create_user(data)
assert fake_repo.get(user.id) is not None

# Avoid: Only verifying a call was made
mock_repo.save.assert_called_once()  # Doesn't prove it worked

Only Verify State-Changing Interactions

If you must use mocks, verify calls that change external state:

# Good: Verify state-changing call
mock_email_service.send.assert_called_once_with(
    to="user@example.com",
    subject="Welcome"
)

# Bad: Verify read-only call
mock_repo.get_user.assert_called_with(user_id)  # Who cares?

Make Assertions Resilient

# Brittle: Exact string match
assert error.message == "Failed to connect to database at localhost:5432"

# Robust: Essential content
assert "Failed to connect" in error.message
assert isinstance(error, ConnectionError)

Critical: Avoid asserting exact log messages. They change constantly.

5. Mocking Techniques (When Necessary)

Match Real Signatures Exactly

# Real function
async def fetch_user(user_id: str, include_profile: bool = False) -> User:
    ...

# Mock must match
mock_fetch = AsyncMock()  # Not MagicMock for async
mock_fetch.return_value = User(id="123", name="Test")

Understand Before Mocking

Never guess at behavior. Inspect real objects first:

# Write a temporary script to inspect real behavior
from external_api import Client

client = Client()
response = client.get_user("123")
print(type(response))      # <class 'User'>
print(dir(response))       # ['id', 'name', 'email', ...]
print(repr(response))      # User(id='123', name='Alice', ...)

Base your mock on observed reality, not assumptions.

Patch Where Used, Not Where Defined

# src/services/user_service.py
from external_api import Client  # Imported here

class UserService:
    def __init__(self):
        self.client = Client()

# In tests - patch where it's used
@patch("src.services.user_service.Client")  # Not "external_api.Client"
def test_user_service(mock_client_class):
    mock_client_class.return_value = fake_client
    ...

Handle Dependency Injection Caching

# If the app caches clients, clear the cache in tests
import src.dependencies as deps

def test_with_mock_client(monkeypatch):
    mock_client = MagicMock()

    # Patch the class where instantiated
    monkeypatch.setattr(deps, "ExternalClient", lambda: mock_client)

    # Clear any cached instance
    if hasattr(deps, "_cached_client"):
        deps._cached_client = None

6. Debugging Test Failures

Investigate Internal Code First

When a test fails, assume the bug is in your code:

Check logic errors in the code under test
Verify assumptions about data
Look for unexpected interactions between components
Trace the call flow

Only blame external libraries after exhausting internal causes.

Add Granular Logging

# Temporarily add detailed logging to understand flow
import logging
log = logging.getLogger(__name__)

def process_order(self, order):
    log.debug(f"[process_order] Entry: order={repr(order)}")

    validated = self._validate(order)
    log.debug(f"[process_order] After validate: {repr(validated)}")

    result = self._save(validated)
    log.debug(f"[process_order] After save: {repr(result)}")

    return result

Use repr() to reveal hidden characters in strings.

Verify Library Interfaces

When you get TypeError from library calls, read the source:

# Error: TypeError: fetch() got unexpected keyword argument 'timeout'

# Don't guess. Check the actual signature in the library:
# def fetch(self, url: str, *, max_retries: int = 3) -> Response:
#                           ^
#                           No timeout parameter!

Systematic Configuration Debugging

For logging/config issues, verify each step:

Config loading (env vars, arguments)
Logger setup (handlers, levels)
File paths and permissions
Execution environment (CWD, potential redirection)

Test in isolation before blaming the framework.

7. Test Organization

Split Large Test Files

tests/
  users/
    test_create_user.py
    test_update_user.py
    test_delete_user.py
    test_user_errors.py
  orders/
    test_create_order.py
    test_order_workflow.py

Verify All Parameterized Variants

@pytest.mark.parametrize("backend", ["asyncio", "trio"])
async def test_connection(backend):
    # Fix must work for ALL variants
    ...

Re-run After Every Change

Critical: After any modification—including linter fixes, formatting, or "trivial" changes—re-run tests. Linter fixes are not behaviorally neutral.

8. Testing Fakes Themselves

Fakes need tests to ensure fidelity with real implementations:

# Contract test - runs against both real and fake
class UserRepositoryContract:
    """Tests that run against any UserRepository implementation."""

    def test_save_and_retrieve(self, repo):
        user = User(id="1", name="Alice")
        repo.save(user)
        assert repo.get("1") == user

    def test_get_nonexistent_returns_none(self, repo):
        assert repo.get("nonexistent") is None

# Run against fake
class TestFakeUserRepository(UserRepositoryContract):
    @pytest.fixture
    def repo(self):
        return FakeUserRepository()

# Run against real (in integration tests)
class TestRealUserRepository(UserRepositoryContract):
    @pytest.fixture
    def repo(self, database):
        return UserRepository(database)

9. Common Pitfalls

Pitfall	Solution
Asserting exact error messages	Assert error type + key substring
Mocking with wrong signature	Copy signature from real code
Using MagicMock for async	Use AsyncMock
Testing implementation details	Test observable behavior only
Mocking types you don't own	Create wrapper, mock wrapper
Mocking value objects	Use real instances
Mocking indirect dependencies	Mock direct dependencies only
Verifying read-only calls	Only verify state-changing calls
Complex mock setup	Switch to fakes
Skipping re-run after changes	Always re-run tests
Blaming framework first	Exhaust internal causes first
Guessing at library behavior	Inspect real objects first
Scope creep during test fixes	Stick to defined scope

ai-testing