test-design-reviewer
ABOUTME: Test quality assessment using Farley's 8 Properties of Good Tests
ABOUTME: Detects tautological tests, mock theatre, and structural test weaknesses
Test Design Reviewer
Quality Notes
- Read every test file thoroughly before scoring
- Quality over speed: analyze what each test actually verifies
- Do not skip the Tautology Theatre check
Process
Step 1: Collect test files
Identify all test files in scope. Use language-appropriate patterns:
- Go:
*_test.go - Python:
test_*.py,*_test.py - Ruby:
*_spec.rb - JS/TS:
*.test.ts,*.spec.ts
Step 2: Score against Farley's 8 Properties
Rate each property 0-10 across the test suite. Provide evidence.
| # | Property | Question to ask | Red flags |
|---|---|---|---|
| 1 | Understandable | Can you tell what's being tested in 5 seconds? | Cryptic names, no arrange/act/assert structure, shared state |
| 2 | Maintainable | Will this break when implementation changes? | Testing private methods, brittle selectors, hardcoded values |
| 3 | Repeatable | Same result every run, any order, any machine? | Time-dependent, filesystem-dependent, test ordering, shared DB state |
| 4 | Atomic | One reason to fail? | Multiple assertions testing different behaviors, setup-heavy |
| 5 | Necessary | Does this test earn its keep? | Duplicate coverage, testing framework/language behavior |
| 6 | Granular | Pinpoints the failure location? | Coarse assertions (assert result), catch-all tests |
| 7 | Fast | Runs in milliseconds? | Real HTTP calls, sleep/wait, full DB setup per test |
| 8 | First | Written before production code? | Tests that mirror implementation structure, not behavior |
Scoring methodology:
- Static scoring: compute 0-10 per property using sigmoid normalization on signal densities (negative signals / test methods, positive signals / test methods). Use
lib/cli_calculator.pyfor deterministic math (JSON in, JSON out). - LLM scoring: assess holistically per property, focusing on semantic aspects static analysis misses (naming quality, assertion appropriateness, tautology theatre)
- Blend:
final_property_score = 0.60 * static_score + 0.40 * llm_scoreper property - Conservative default: when no signals detected for a property, default to 5.0 (unknown quality, not good quality)
Per-property scoring rubrics (anchor scores to these bands):
| Property | 9-10 | 7-8 | 5-6 | 3-4 | 1-2 |
|---|---|---|---|---|---|
| U | Reads like specs; behavior clear without reading impl | Clear with minor ambiguities | Requires code inspection to understand | Cryptic; relies on impl details | test1/test2; magic numbers throughout |
| M | Proper abstractions; verifies behavior not impl | Good separation; occasional brittleness | Some impl coupling; some over-specified mocks | Tightly coupled; verify with exact counts | Reflection for private fields; mirrors impl exactly |
| R | Fully deterministic; no external deps | Rarely flaky; minimal env deps | Occasional flakiness; timing deps | Filesystem, timing, env deps present | sleep, file I/O, network, system time, unseeded random |
| A | Fully isolated; no shared state; parallelizable | Mostly isolated; minor shared setup | Some shared state; order sometimes matters | Heavy interdeps; must run in order | Shared mutable statics; ordering annotations |
| N | Every test adds unique value; parameterized for variations | Most tests valuable; minor redundancy | Checkbox exercises; moderate redundancy | Redundant tests; framework testing; mock tautologies | assertTrue(true); disabled tests; tests verify only mocks |
| G | Each test verifies single outcome; pinpoints issues | Focused; occasional logical assertion groups | Multiple behaviors; failure diagnosis takes effort | Sprawling; multiple unrelated assertions | 20+ assertions; testEverything() methods |
| F | Pure computation; no I/O; milliseconds | Quick; minor optimization opportunities | Some slow tests; noticeable suite time | File I/O or database calls | sleep, network calls, heavy setup/teardown |
| T | Clear test-first evidence; tests drive design | Likely test-first; good design influence | Unclear; tests may be afterthoughts | Mirrors impl; likely test-after; mock-heavy | Clearly written after code; coverage patches |
Aggregation methodology:
- Per-test-method: collect signals at individual method level
- Per-test-file: mean for positive signals, P90 for negative signals (worst offenders must surface)
- Per-test-suite: LOC-weighted mean across files
Sampling for large suites:
- Under 50 test files: analyze all
- Over 50: SHA-256 deterministic selection (30%) plus all files exceeding 100 test methods
Weighted Farley Index = (U*1.5 + M*1.5 + R*1.25 + A*1.0 + N*1.0 + G*1.0 + F*0.75 + T*1.0) / 9.0
Divisor is 9.0 (sum of weights), not 8 (number of properties). U/M weighted highest (readability, coupling); F weighted lowest (speed is contextual).
| Range | Rating | Interpretation |
|---|---|---|
| 9.0-10.0 | Exemplary | Model suite; tests serve as living documentation |
| 7.5-8.9 | Excellent | High quality with minor improvement opportunities |
| 6.0-7.4 | Good | Solid foundation with clear areas for improvement |
| 4.5-5.9 | Fair | Functional but needs significant attention to test design |
| 3.0-4.4 | Poor | Tests provide limited value; major refactoring needed |
| 0.0-2.9 | Critical | Tests may be harmful; consider rewriting from scratch |
Step 3: Tautology Theatre Detection
The critical question: "Would this test still pass if all production code were deleted?"
Scan for these 4 patterns:
| Pattern | What it looks like | Example |
|---|---|---|
| Mock tautology | Test verifies that a mock returns what it was told to return | mock.return_value = 42; assert service.get() == 42 (only tests the mock) |
| Mock-only test | Every dependency is mocked, nothing real executes | Test with 5 mocks and zero real objects |
| Trivial tautology | Assertion is always true regardless of code | assert isinstance(result, dict) when function signature guarantees dict |
| Framework test | Tests framework behavior, not application logic | Testing that Rails validations work, that pytest fixtures inject |
Also scan for mock interaction anti-patterns (affect Maintainable score):
| Pattern | What it looks like |
|---|---|
| Over-specified interactions | verify with exact call counts, call ordering, verifyNoMoreInteractions |
| Testing internal details | ArgumentCaptor deep inspection, verify(never()) mirroring branches, high verify-to-assert ratio |
For each tautology or anti-pattern found: report the file, line, pattern type, and why it's problematic.
Step 4: Report
## Test Design Review
### Farley Index: X.X / 10.0 (Rating)
| Property | Static | LLM | Blended | Weight | Weighted | Key Evidence |
|----------|--------|-----|---------|--------|----------|--------------|
| Understandable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Maintainable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Repeatable | X.X | X.X | X.X | 1.25x | X.XX | ... |
| Atomic | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Necessary | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Granular | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Fast | X.X | X.X | X.X | 0.75x | X.XX | ... |
| First (TDD) | X.X | X.X | X.X | 1.00x | X.XX | ... |
### Tautology Theatre Analysis
Each subsection always present; use "None detected." when empty.
#### Mock Tautologies
| Test Method | Line | Mock Setup | Assertion |
#### Mock-Only Tests
| Test Method | Line | Evidence |
#### Trivial Tautologies
| Test Method | Line | Assertion |
#### Framework Tests
| Test Method | Line | Assertion | What It Actually Tests |
**Summary**: {total} instances across {affected}/{total_methods} test methods.
### Top 3 Improvements
1. [Highest-impact fix targeting weakest high-weight property]
2. [Second priority]
3. [Third priority]
### Methodology Notes
- Static/LLM blend: 60/40
- Files analyzed: {count} ({sampling note})
- Language: {lang}, Framework: {framework}
Integration with Review Pipeline
This skill is invoked by the orchestrator when test files are in scope (see orchestrator-protocol.md, review routing step). Can also be invoked directly via /test-design-reviewer.
Deterministic Scoring Calculator
lib/cli_calculator.py provides JSON-in, JSON-out math for reproducible scores. Delegate all Farley Index arithmetic to this CLI to avoid LLM rounding drift.
Commands: normalize-property, blend-scores, compute-farley, get-rating, aggregate-file, aggregate-suite, full-pipeline.
# Normalize a single property from signal counts
python lib/cli_calculator.py normalize-property '{"prop":"U","neg_count":2,"pos_count":8,"total_methods":20}'
# Compute Farley Index from 8 blended scores
python lib/cli_calculator.py compute-farley '{"U":8.5,"M":7.0,"R":9.0,"A":8.0,"N":7.5,"G":8.0,"F":6.0,"T":7.0}'
# End-to-end: raw signals + optional LLM scores -> index + rating
python lib/cli_calculator.py full-pipeline '{"properties":{"U":{"neg_count":2,"pos_count":8,"total_methods":20},...},"llm_scores":{"U":8.0,...}}'
Common Issues
| Issue | Solution |
|---|---|
| No test files found | Check file patterns; some projects use non-standard locations |
| High mock count ≠ bad | Mocks are fine when testing boundaries; flag only when they replace all real logic |
| Property scores vary by test type | Score unit tests and integration tests separately if the suite is mixed |
| Legacy test suite scores low | Focus improvements on the top 3, not a full rewrite |
More from maroffo/claude-forge
email-cleanup
Clean up Gmail - archive old emails, delete promotions, manage storage. Use when user wants to clean inbox, archive emails, or free up space.
25newsletter-digest
Process newsletters into Second Brain digest. Use when user wants to process newsletters, create digest, or catch up on subscriptions. Not for web clippings (use process-clippings) or email bookmarks (use process-email-bookmarks).
22table-image
Render markdown tables as hand-drawn sketch images. Use when user wants a table rendered as an image, visual table, or diagram illustration.
21apple-swift
Apple platform development with Swift, SwiftUI, async/await, and performance. Use when working with .swift files, Package.swift, Xcode projects, or building for iOS/macOS/watchOS/visionOS.
20react-nextjs
React + Next.js App Router development. Use when working with .tsx/.jsx files, next.config, or user asks about Server Components, data fetching, state management, forms, or React testing.
20inbox-triage
Review and prioritize Gmail inbox. Use when user wants to check email, review inbox, or see what needs attention.
19