sensei-prove-it

Installation

SKILL.md

Test Proof

Evaluate what the tests prove and what they fail to prove.

Philosophy

Tests that pass are not the same as tests that verify the behavior.

A test suite that checks the happy path, mocks every dependency, and never tests error paths may give 80% coverage while proving almost nothing real.

The question is not: did the tests pass? The question is: what would have to be true about the code for these tests to catch a regression?

Ask the developer to answer that question before reviewing the tests yourself.

Questions to ask

What does each test actually verify?

Is it testing the contract (inputs and outputs) or the implementation (internal calls)?
Would this test catch the most likely bugs?
Does the test name describe what it proves?

What is not tested?

What happens when inputs are invalid or at their boundaries?
What happens when a dependency fails?
What happens with empty arrays, zero values, null, or very large inputs?
What happens under concurrent access?

Are security-sensitive behaviors tested?

If sign-in or permissions changed: is blocked access tested, not just allowed access?
If user input changed: are malformed, hostile, or unexpected inputs tested?
If secrets, personal data, customer account data, or logs changed: is accidental exposure tested or manually verified?

Are characterization tests present for legacy or unfamiliar code?

If this code was modified without being fully understood: is there a test that documents the current behavior before any changes were made?
A characterization test pins what the code does, not what it should do. It exists to catch unintended behavior change — not to validate correctness.
Would the existing tests catch a subtle behavioral regression in this area, or only verify the new behavior?

Are the mocks meaningful?

Do the mocks return realistic data?
Would a real dependency behave the same way as the mock in edge cases?
Is the test actually testing the mock's behavior rather than the code's?

Is this the right test level?

Should this be a unit test, integration test, or end-to-end test?
Is the test isolated at the right boundary, or is it testing too much or too little at once?

What is the failure mode?

If the behavior this test covers were to break, would the test catch it?
Would the test still pass even if [specific behavior] were removed?

Output format

Plain-English takeaway:
[Whether a non-technical owner should feel confident, cautious, or blocked]

What the tests prove:
[Concrete list — not "tests happy path" but "proves that X returns Y when Z"]

What the tests do not prove:
[Specific uncovered scenarios]

Characterization coverage:
[If legacy or unfamiliar code changed: what current behavior was pinned before the change, "missing", or "not applicable"]

Security coverage:
[What sign-in, permission, input, secret, privacy, or customer-boundary behavior is proven, missing, or not applicable]

Riskiest uncovered case:
[The one missing test most likely to correspond to a real bug]

Mock quality:
[Are mocks realistic? What assumptions do they embed?]

Evidence to add next:
[The smallest test or check that would reduce the biggest risk]

Question for you:
[A specific question about what the developer was trying to prove]

Rules

Do not count test files or coverage percentages. Read what the tests actually assert.
Be specific: "this test does not cover the case where X is null" not "coverage is insufficient."
Ask the developer to explain what each test is for before critiquing it.
Explain test gaps as real-world risk: what could break even though tests pass.
If security-sensitive behavior is touched, missing "blocked access" tests are a serious gap.
Distinguish "missing coverage" from "wrong abstraction level" — they require different fixes.
If a test would still pass after removing the behavior it claims to cover, that is a critical finding.
Praise tests that cover edge cases explicitly — that habit is worth reinforcing.
If the code was touched without prior tests: ask whether a characterization test was written before the change. If not, that is the first gap to address — not the missing edge cases.
Do not treat a characterization test as proof the behavior is correct. It proves the behavior stayed stable.

Related skills

More from onehorizonai/sensei

Installs

Repository

onehorizonai/sensei

First Seen

Today

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

sensei-prove-it

Test Proof

Philosophy

Questions to ask

Output format

Rules

More from onehorizonai/sensei

sensei-gameplan

sensei-spar

sensei-help

sensei-align

sensei-reflect

sensei-trace