test-design-reviewer

Installation
SKILL.md

ABOUTME: Test quality assessment using Farley's 8 Properties of Good Tests

ABOUTME: Detects tautological tests, mock theatre, and structural test weaknesses

Test Design Reviewer

Quality Notes

  • Read every test file thoroughly before scoring
  • Quality over speed: analyze what each test actually verifies
  • Do not skip the Tautology Theatre check

Process

Step 1: Collect test files

Identify all test files in scope. Use language-appropriate patterns:

  • Go: *_test.go
  • Python: test_*.py, *_test.py
  • Ruby: *_spec.rb
  • JS/TS: *.test.ts, *.spec.ts

Step 2: Score against Farley's 8 Properties

Rate each property 0-10 across the test suite. Provide evidence.

# Property Question to ask Red flags
1 Understandable Can you tell what's being tested in 5 seconds? Cryptic names, no arrange/act/assert structure, shared state
2 Maintainable Will this break when implementation changes? Testing private methods, brittle selectors, hardcoded values
3 Repeatable Same result every run, any order, any machine? Time-dependent, filesystem-dependent, test ordering, shared DB state
4 Atomic One reason to fail? Multiple assertions testing different behaviors, setup-heavy
5 Necessary Does this test earn its keep? Duplicate coverage, testing framework/language behavior
6 Granular Pinpoints the failure location? Coarse assertions (assert result), catch-all tests
7 Fast Runs in milliseconds? Real HTTP calls, sleep/wait, full DB setup per test
8 First Written before production code? Tests that mirror implementation structure, not behavior

Scoring methodology:

  • Static scoring: compute 0-10 per property using sigmoid normalization on signal densities (negative signals / test methods, positive signals / test methods). Use lib/cli_calculator.py for deterministic math (JSON in, JSON out).
  • LLM scoring: assess holistically per property, focusing on semantic aspects static analysis misses (naming quality, assertion appropriateness, tautology theatre)
  • Blend: final_property_score = 0.60 * static_score + 0.40 * llm_score per property
  • Conservative default: when no signals detected for a property, default to 5.0 (unknown quality, not good quality)

Per-property scoring rubrics (anchor scores to these bands):

Property 9-10 7-8 5-6 3-4 1-2
U Reads like specs; behavior clear without reading impl Clear with minor ambiguities Requires code inspection to understand Cryptic; relies on impl details test1/test2; magic numbers throughout
M Proper abstractions; verifies behavior not impl Good separation; occasional brittleness Some impl coupling; some over-specified mocks Tightly coupled; verify with exact counts Reflection for private fields; mirrors impl exactly
R Fully deterministic; no external deps Rarely flaky; minimal env deps Occasional flakiness; timing deps Filesystem, timing, env deps present sleep, file I/O, network, system time, unseeded random
A Fully isolated; no shared state; parallelizable Mostly isolated; minor shared setup Some shared state; order sometimes matters Heavy interdeps; must run in order Shared mutable statics; ordering annotations
N Every test adds unique value; parameterized for variations Most tests valuable; minor redundancy Checkbox exercises; moderate redundancy Redundant tests; framework testing; mock tautologies assertTrue(true); disabled tests; tests verify only mocks
G Each test verifies single outcome; pinpoints issues Focused; occasional logical assertion groups Multiple behaviors; failure diagnosis takes effort Sprawling; multiple unrelated assertions 20+ assertions; testEverything() methods
F Pure computation; no I/O; milliseconds Quick; minor optimization opportunities Some slow tests; noticeable suite time File I/O or database calls sleep, network calls, heavy setup/teardown
T Clear test-first evidence; tests drive design Likely test-first; good design influence Unclear; tests may be afterthoughts Mirrors impl; likely test-after; mock-heavy Clearly written after code; coverage patches

Aggregation methodology:

  • Per-test-method: collect signals at individual method level
  • Per-test-file: mean for positive signals, P90 for negative signals (worst offenders must surface)
  • Per-test-suite: LOC-weighted mean across files

Sampling for large suites:

  • Under 50 test files: analyze all
  • Over 50: SHA-256 deterministic selection (30%) plus all files exceeding 100 test methods

Weighted Farley Index = (U*1.5 + M*1.5 + R*1.25 + A*1.0 + N*1.0 + G*1.0 + F*0.75 + T*1.0) / 9.0

Divisor is 9.0 (sum of weights), not 8 (number of properties). U/M weighted highest (readability, coupling); F weighted lowest (speed is contextual).

Range Rating Interpretation
9.0-10.0 Exemplary Model suite; tests serve as living documentation
7.5-8.9 Excellent High quality with minor improvement opportunities
6.0-7.4 Good Solid foundation with clear areas for improvement
4.5-5.9 Fair Functional but needs significant attention to test design
3.0-4.4 Poor Tests provide limited value; major refactoring needed
0.0-2.9 Critical Tests may be harmful; consider rewriting from scratch

Step 3: Tautology Theatre Detection

The critical question: "Would this test still pass if all production code were deleted?"

Scan for these 4 patterns:

Pattern What it looks like Example
Mock tautology Test verifies that a mock returns what it was told to return mock.return_value = 42; assert service.get() == 42 (only tests the mock)
Mock-only test Every dependency is mocked, nothing real executes Test with 5 mocks and zero real objects
Trivial tautology Assertion is always true regardless of code assert isinstance(result, dict) when function signature guarantees dict
Framework test Tests framework behavior, not application logic Testing that Rails validations work, that pytest fixtures inject

Also scan for mock interaction anti-patterns (affect Maintainable score):

Pattern What it looks like
Over-specified interactions verify with exact call counts, call ordering, verifyNoMoreInteractions
Testing internal details ArgumentCaptor deep inspection, verify(never()) mirroring branches, high verify-to-assert ratio

For each tautology or anti-pattern found: report the file, line, pattern type, and why it's problematic.

Step 4: Report

## Test Design Review

### Farley Index: X.X / 10.0 (Rating)

| Property | Static | LLM | Blended | Weight | Weighted | Key Evidence |
|----------|--------|-----|---------|--------|----------|--------------|
| Understandable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Maintainable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Repeatable | X.X | X.X | X.X | 1.25x | X.XX | ... |
| Atomic | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Necessary | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Granular | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Fast | X.X | X.X | X.X | 0.75x | X.XX | ... |
| First (TDD) | X.X | X.X | X.X | 1.00x | X.XX | ... |

### Tautology Theatre Analysis

Each subsection always present; use "None detected." when empty.

#### Mock Tautologies
| Test Method | Line | Mock Setup | Assertion |
#### Mock-Only Tests
| Test Method | Line | Evidence |
#### Trivial Tautologies
| Test Method | Line | Assertion |
#### Framework Tests
| Test Method | Line | Assertion | What It Actually Tests |

**Summary**: {total} instances across {affected}/{total_methods} test methods.

### Top 3 Improvements
1. [Highest-impact fix targeting weakest high-weight property]
2. [Second priority]
3. [Third priority]

### Methodology Notes
- Static/LLM blend: 60/40
- Files analyzed: {count} ({sampling note})
- Language: {lang}, Framework: {framework}

Integration with Review Pipeline

This skill is invoked by the orchestrator when test files are in scope (see orchestrator-protocol.md, review routing step). Can also be invoked directly via /test-design-reviewer.

Deterministic Scoring Calculator

lib/cli_calculator.py provides JSON-in, JSON-out math for reproducible scores. Delegate all Farley Index arithmetic to this CLI to avoid LLM rounding drift.

Commands: normalize-property, blend-scores, compute-farley, get-rating, aggregate-file, aggregate-suite, full-pipeline.

# Normalize a single property from signal counts
python lib/cli_calculator.py normalize-property '{"prop":"U","neg_count":2,"pos_count":8,"total_methods":20}'

# Compute Farley Index from 8 blended scores
python lib/cli_calculator.py compute-farley '{"U":8.5,"M":7.0,"R":9.0,"A":8.0,"N":7.5,"G":8.0,"F":6.0,"T":7.0}'

# End-to-end: raw signals + optional LLM scores -> index + rating
python lib/cli_calculator.py full-pipeline '{"properties":{"U":{"neg_count":2,"pos_count":8,"total_methods":20},...},"llm_scores":{"U":8.0,...}}'

Common Issues

Issue Solution
No test files found Check file patterns; some projects use non-standard locations
High mock count ≠ bad Mocks are fine when testing boundaries; flag only when they replace all real logic
Property scores vary by test type Score unit tests and integration tests separately if the suite is mixed
Legacy test suite scores low Focus improvements on the top 3, not a full rewrite
Related skills
Installs
1
GitHub Stars
13
First Seen
Mar 29, 2026