test-guided-bug-detector
Test-Guided Bug Detector
When tests fail, the failure set itself is a signal. One failure tells you where to look; the pattern across many failures tells you what kind of thing broke.
Step 1 — Triage by failure shape
| Failure pattern | Most likely cause | First move |
|---|---|---|
| One test fails | Localized bug in the code that test covers | Read the assertion; → bug-localization |
| Many tests fail with the same error | Shared dependency broke (fixture, helper, import) | Find the shared thing — not the individual tests |
| Many tests fail with different errors | Environment/infra (DB down, fixture not loading) | Check setup/teardown logs, not test bodies |
| All tests in one file fail | Module-level import/fixture in that file | Check the file's top-level, not the tests |
| Tests fail only in CI, not locally | Env difference: version, path, timezone, locale, parallelism | Diff CI env vs local env, not the code |
| Tests fail only when run together | Test pollution — one test mutates shared state | Bisect the test order; find the polluter |
| Same tests intermittently fail | Flake — timing, network, randomness | Do NOT chase the code — stabilize the test |
Step 2 — Spectrum-based fault localization (when many tests fail)
The classic move: code executed by failing tests but not by passing tests is suspicious.
- Run with coverage. Record which lines each test hits.
- For each line, compute suspiciousness (Ochiai):
fail_hits / sqrt(total_fails × (fail_hits + pass_hits)) - Sort descending. The top lines are where the fault probably lives.
This is mechanical but surprisingly effective. You need ≥3 failing and ≥3 passing tests for the signal to separate from noise.
Step 3 — Cluster failures
Before debugging, group failures that share a root cause. Debugging 20 failures that are secretly 1 bug is 19× wasted effort.
Cluster by, in order:
- Identical exception type + message → almost certainly same bug
- Identical top-of-stack frame → very likely same bug
- Same file-under-test → likely
- Same fixture used → possible (if the fixture is the bug)
Pick the largest cluster. Fix it. Re-run. Repeat.
Worked example
Input: 47 tests failing after a merge.
Triage:
- 41 fail with
KeyError: 'tenant_id'→ same error → one cluster - 5 fail with various assertion mismatches in
test_billing.py→ file-local → one cluster - 1 fails with
ConnectionRefused→ infra → ignore for now
Cluster 1 (41 tests): All 41 use @with_authenticated_user fixture. Fixture source: creates a User dict. Grep the diff: tenant_id was added as a required field in User.__init__ but the fixture wasn't updated.
Root cause: One line in conftest.py. 41 failures → 1 bug.
Cluster 2 (5 tests): After fixing cluster 1, re-run. 3 of the 5 now pass (they were also blocked by the fixture). 2 remain. Both assert on a dollar amount that's off by exactly the tax rate. The merge also changed tax calculation.
47 → 2 root causes.
Edge cases
- Every single test fails: Your test runner is broken, not your code. Import error at
conftest.py/setup.pylevel, or the test DB didn't come up. - The failing assertion is
True == Trueor similar tautology: The test itself is broken — pytest collected an accidentally-named non-test function, or someone committed aassert True # TODOplaceholder. - New tests fail, old tests pass: The new tests might be wrong. Don't assume the product code is at fault until you've read the new test.
Do not
- Do not debug failures one at a time without clustering first. You will fix the same bug five times.
- Do not assume the test is right. The test is a claim; verify the claim against spec before chasing the code.
- Do not re-run a flaky test until it passes and call it fixed. Mark it, quarantine it, move on — but don't lie to yourself.
- Do not skip spectrum analysis because it sounds fancy. It's three lines of script and it cuts search time in half.
Output format
## Clusters
1. <N> failures — <shared root: exception/fixture/file>
Suspected fault: <file:line> (<how you narrowed it>)
2. ...
## Recommended order
Fix cluster <N> first (<reason: biggest / blocks others / fastest>)
## Quarantine
- <test name>: flaky, <mechanism> — do not chase
More from santosomar/general-secure-coding-agent-skills
code-review-assistant
Performs structured code review on a diff or file set, producing inline comments with severity levels and a summary. Checks correctness, error handling, security, and maintainability — in that priority order. Use when reviewing a pull request, when the user asks for a code review, when preparing code for merge, or when a second opinion is needed on a change.
15dependency-resolver
Diagnoses and resolves package dependency conflicts — version mismatches, diamond dependencies, cycles — across npm, pip, Maven, Cargo, and similar ecosystems. Use when install fails with a resolution error, when two packages require incompatible versions of a third, or when upgrading one dependency breaks another.
4configuration-generator
Generates configuration files for services and tools (app config, logging config, linter config, database config) from a brief description of desired behavior, matching the target format's idioms. Use when bootstrapping a new service, when the user asks for a config file for a specific tool, or when translating config intent between formats.
3ci-pipeline-synthesizer
Generates CI pipeline configs by analyzing a repo's structure, language, and build needs — GitHub Actions, GitLab CI, or other platforms. Use when bootstrapping CI for a new repo, when porting from one CI to another, when the user asks for a pipeline that builds and tests their project, or when wiring in security gates.
3api-design-assistant
Reviews and designs API contracts — function signatures, REST endpoints, library interfaces — for usability, evolvability, and the principle of least surprise. Use when designing a new public interface, when reviewing an API PR, when the user asks whether a signature is well-designed, or when planning a breaking change.
2code-refactoring-assistant
Executes refactorings — extract method, inline, rename, move — in small, behavior-preserving steps with a test between each. Use when the user wants to restructure working code, when cleaning up after a feature lands, or when a smell has been identified and needs fixing.
2