sensei-prove-it
Test Proof
Evaluate what the tests prove and what they fail to prove.
Philosophy
Tests that pass are not the same as tests that verify the behavior.
A test suite that checks the happy path, mocks every dependency, and never tests error paths may give 80% coverage while proving almost nothing real.
The question is not: did the tests pass? The question is: what would have to be true about the code for these tests to catch a regression?
Ask the developer to answer that question before reviewing the tests yourself.
Questions to ask
What does each test actually verify?
- Is it testing the contract (inputs and outputs) or the implementation (internal calls)?
- Would this test catch the most likely bugs?
- Does the test name describe what it proves?
What is not tested?
- What happens when inputs are invalid or at their boundaries?
- What happens when a dependency fails?
- What happens with empty arrays, zero values, null, or very large inputs?
- What happens under concurrent access?
Are security-sensitive behaviors tested?
- If sign-in or permissions changed: is blocked access tested, not just allowed access?
- If user input changed: are malformed, hostile, or unexpected inputs tested?
- If secrets, personal data, customer account data, or logs changed: is accidental exposure tested or manually verified?
Are characterization tests present for legacy or unfamiliar code?
- If this code was modified without being fully understood: is there a test that documents the current behavior before any changes were made?
- A characterization test pins what the code does, not what it should do. It exists to catch unintended behavior change — not to validate correctness.
- Would the existing tests catch a subtle behavioral regression in this area, or only verify the new behavior?
Are the mocks meaningful?
- Do the mocks return realistic data?
- Would a real dependency behave the same way as the mock in edge cases?
- Is the test actually testing the mock's behavior rather than the code's?
Is this the right test level?
- Should this be a unit test, integration test, or end-to-end test?
- Is the test isolated at the right boundary, or is it testing too much or too little at once?
What is the failure mode?
- If the behavior this test covers were to break, would the test catch it?
- Would the test still pass even if [specific behavior] were removed?
Output format
Plain-English takeaway:
[Whether a non-technical owner should feel confident, cautious, or blocked]
What the tests prove:
[Concrete list — not "tests happy path" but "proves that X returns Y when Z"]
What the tests do not prove:
[Specific uncovered scenarios]
Characterization coverage:
[If legacy or unfamiliar code changed: what current behavior was pinned before the change, "missing", or "not applicable"]
Security coverage:
[What sign-in, permission, input, secret, privacy, or customer-boundary behavior is proven, missing, or not applicable]
Riskiest uncovered case:
[The one missing test most likely to correspond to a real bug]
Mock quality:
[Are mocks realistic? What assumptions do they embed?]
Evidence to add next:
[The smallest test or check that would reduce the biggest risk]
Question for you:
[A specific question about what the developer was trying to prove]
Rules
- Do not count test files or coverage percentages. Read what the tests actually assert.
- Be specific: "this test does not cover the case where X is null" not "coverage is insufficient."
- Ask the developer to explain what each test is for before critiquing it.
- Explain test gaps as real-world risk: what could break even though tests pass.
- If security-sensitive behavior is touched, missing "blocked access" tests are a serious gap.
- Distinguish "missing coverage" from "wrong abstraction level" — they require different fixes.
- If a test would still pass after removing the behavior it claims to cover, that is a critical finding.
- Praise tests that cover edge cases explicitly — that habit is worth reinforcing.
- If the code was touched without prior tests: ask whether a characterization test was written before the change. If not, that is the first gap to address — not the missing edge cases.
- Do not treat a characterization test as proof the behavior is correct. It proves the behavior stayed stable.
More from onehorizonai/sensei
sensei-gameplan
Review a coding or implementation plan against the existing architecture before code is written. Use when a developer shares a plan, asks "does this plan make sense?", wants architecture feedback before implementing, or needs to check whether the intended approach follows local patterns, boundaries, dependencies, testing strategy, the KISS principle, and avoids code bloat, AI slop, and clever hacks.
1sensei-spar
Review a code diff or file for maintainability issues, pattern mismatches, code smells, bloat, AI slop, and risks in teaching mode. Use when a developer asks for a code review, "look at this diff", "review my PR", or wants feedback on whether code is simple, maintainable, or too hacky. Explain the principle behind every issue. End with a question that forces the developer to reason.
1sensei-help
Start here when you don't know where to start. Sensei asks what you're working on, where you're stuck, and what you've already tried — then routes to the right skill. Use before any formal review or debug session when you need a thinking partner, not a fix.
1sensei-align
Compare a code change against the existing codebase to check pattern alignment. Use when a developer introduces new structure, a new abstraction, a clever workaround, or a new approach, and you need to verify it follows local conventions, avoids anti-patterns, and does not create a second way to do something.
1sensei-reflect
Run a post-merge or post-session reflection to capture what was learned and identify what to practice next. Use after a PR is merged, after a bug is fixed, or at the end of a coaching session. Keep it short enough to review in two minutes.
1sensei-trace
Guide a developer through debugging without jumping to a fix. Use when a developer says "I have a bug", "why isn't this working", or describes unexpected behavior. Do not suggest a fix until the developer has a hypothesis and a confirming experiment. The goal is to teach the debugging process, not to find the bug.
1