fix-tests
Who you are: If
.helpmetest/SOUL.mdexists, read it — it defines your character.
No MCP? Use
helpmetest <command>CLI instead. See README for CLI reference.
🔴 YOU WRITE THE TEST FIRST.
Changed code → run the tests. New feature → write the test before the code. The test is the spec. The test is done when it's green. No test = not done.
Narrate Your Actions
Never create a test, artifact, or run a test silently. Always tell the user:
- Before: what you are about to do and why (what scenario it covers, what risk it guards against)
- After: what happened — result, what the artifact contains, why a test failed
- Next: what you will do next and what decision point is coming
Silence means the user has no idea what you did or why.
Fix Tests
One skill for everything wrong with your test suite. Reads the situation, picks the right mode.
Prerequisites — Always Do This First
helpmetest_status()
helpmetest_search_artifacts({ query: "" })
helpmetest_search_artifacts({ query: "Memory" })
how_to({ type: "context_discovery" })
how_to({ type: "interactive_debugging" })
how_to({ type: "debugging_self_healing" })
Check git state:
git log --oneline -10
git diff --stat HEAD
Read the Situation → Pick the Mode
After orient, classify:
| Signal | Mode |
|---|---|
| "Something broke" / "it stopped working" / vague signal | Triage first (see below) |
| One specific test named by user, or one test failing | Debug |
| Multiple tests failing after a deploy or UI change | Heal |
| Tests passing but code changed — drift suspected | Sync |
| "Is this test any good?" / reviewing test quality | Validate |
| Mixed (failures + drift + quality issues) | All modes, in order |
Triage (when you don't know what's wrong)
Gather fast, diagnose specifically, then switch to the right mode.
Collect everything in parallel:
helpmetest_status() // failing tests, health checks
git log --oneline -10 // recent commits
git diff --stat HEAD // uncommitted changes
Map what you find to a root cause:
- Test issue — test fails but feature works. Selector changed, timing off, stale after refactor → Debug or Heal mode
- App bug — feature itself is broken. 500 errors, missing data, broken flow → document in Feature.bugs[], tell user
- Regression — worked before a specific commit. Identify the commit, scope blast radius → Debug mode + recommend rollback or hotfix
- Environment — auth state expired, proxy down, env var missing → fix setup, re-run auth test
- Coverage gap — "it's broken" but no test exists → create Feature artifact, run
/tdd
State the diagnosis once before acting: "Based on [evidence], the problem is [specific cause]. The fix is [action]." Then switch to the right mode.
Mode: Debug — One Test, Root Cause
Golden Rule: Always reproduce interactively before fixing. Never guess.
Tasks Artifact
Create before starting:
{
"type": "Tasks",
"name": "Tasks: Debug [test name]",
"content": {
"overview": "Debug failing test [test-id]. Root cause → fix or document bug.",
"tasks": [
{ "id": "1.0", "title": "Understand the failure", "status": "pending", "priority": "critical" },
{ "id": "2.0", "title": "Reproduce interactively", "status": "pending", "priority": "critical" },
{ "id": "3.0", "title": "Determine root cause", "status": "pending", "priority": "critical" },
{ "id": "4.0", "title": "Fix test OR document bug", "status": "pending", "priority": "critical" }
]
}
}
Phase 1: Understand
helpmetest_open_test+helpmetest_status({ id, testRunLimit: 10 })- Read the error. Classify: selector? timing? assertion? state? API?
- Check recent git changes — map changed files to likely failure causes
- Load the Feature artifact the test belongs to
Phase 2: Reproduce Interactively
Run steps one at a time via helpmetest_run_interactive_command:
As <auth_state>
Go To <url>
# → observe after each step
Stop at the failing step. Investigate based on error type:
- Element not found: Try alternate selectors — is element gone (bug) or selector changed (test issue)?
- Not interactable: Check visibility, scroll, multiple matches, disabled state
- Assertion failed: What's actually displayed? Behavior changed intentionally?
- Timeout: App slow or broken?
Phase 3: Root Cause
- Selector changed → fix selector
- Timing → add wait
- State/auth → verify auth state restoration
- API error → document bug
- Test isolation (alternating PASS/FAIL, shared state) → make idempotent
Phase 4A: Fix Test
- Validate fix interactively first — run the complete corrected flow
- Update via
helpmetest_upsert_test - Run via
helpmetest_run_testto confirm - Update Feature artifact
Phase 4B: Document Bug
Add to Feature.bugs[]:
{
"name": "Brief description",
"given": "Precondition",
"when": "Action taken",
"then": "Expected outcome",
"actual": "What actually happens",
"severity": "blocker|critical|major|minor",
"url": "http://example.com/page",
"tags": []
}
Update Feature.status → "broken" or "partial".
Mode: Heal — Bulk Failures After Deploy
Don't fix blindly — classify first, then fix fast.
Tasks Artifact
{
"type": "Tasks",
"name": "Tasks: Heal Session [date]",
"content": {
"overview": "Healing [N] failing tests.",
"tasks": [
{ "id": "1.0", "title": "[test-id]: [test name]", "status": "pending", "priority": "critical",
"notes": "[error summary from last run]" }
],
"notes": ["SelfHealing artifact: self-healing-log"]
}
}
Startup: Fix All Existing Failures
- Get all failing tests from
helpmetest_status - For each failing test:
- Classify failure type
- Fixable (selector change, timing, form structure): investigate → fix → verify → document in SelfHealing artifact
- Not fixable (auth broken, 500 errors, missing pages): document as bug in Feature artifact
- After processing all failures, enter monitoring mode
Fixable vs Not:
- Fixable: selector changed, timing issue, form added/removed, button moved, test isolation
- Not fixable: auth broken, server errors, missing features, API endpoints removed
Monitoring Mode
listen_to_events({ type: "test_run_completed" })
When a test fails: classify → fix if fixable → document if not → resume listening.
SelfHealing Artifact
{
"type": "SelfHealing",
"id": "self-healing-log",
"name": "SelfHealing: Test Maintenance Log",
"content": {
"fixed": [
{ "test_id": "test-login", "pattern_detected": "selector_change",
"fix_applied": "Updated selector to [data-testid='submit-btn']",
"verification_result": "Test passed on re-run", "timestamp": "..." }
],
"not_fixed": [
{ "test_id": "test-checkout", "issue_type": "server_error",
"error_message": "500 on POST /api/checkout",
"why_not_fixable": "Application bug, not a test issue",
"recommendation": "Investigate checkout API endpoint" }
],
"summary": { "total_processed": 5, "fixed": 3, "not_fixable": 2, "last_run": "..." }
}
}
Mode: Sync — Drift Audit After Refactor
Tests may be passing but wrong — stale assertions, removed features, changed behavior.
Discrepancy Types
Failure-based:
- Code Broke It — test was passing, code change caused regression → fix code
- Test Is Stale — code intentionally changed, test hasn't caught up → fix test
- Not Deployed — fix in local code, not shipped yet → tag pending-deploy
- Removed Feature — test exercises what no longer exists → delete test
Passing but suspicious: 5. False Positive — passes but assertions too weak to verify anything 6. Flaky — passes sometimes, fails sometimes with no code change 7. Duplicate Coverage — two tests cover the exact same scenario
Coverage gaps:
8. Missing Test — feature exists, no test coverage
9. Scenario Gap — Feature artifact has scenario but test_ids is empty
10. Scenario Drift — tests and code agree but Feature artifact documents old behavior
11. Selector / Schema Drift — test's selectors or API shape no longer matches code
Workflow
- Run all tests:
helpmetest_status→ get IDs → run each - For each test + each Feature artifact, check for discrepancy types above
- Record: type, test, Feature, what test expects vs what code does, git evidence
Sync Report (present before resolving)
🔄 Sync Report · <project> · <date>
<N> failing · <N> flaky · <N> gaps · <N> passing
💥 Failures
Code Broke It · <N> tests
<test name>
issue: <one line>
🕳 Gaps
Missing Test · <N>
<feature name> — <what it does>
Wait for user to confirm, then resolve one by one.
Resolution Options (per discrepancy)
#3 of 12 · TEST IS STALE
📋 <test name>
expects <what test asserts>
code now <what code does> · <file> · <commit>
1 · Fix the test [code leads]
2 · Fix the code [test leads]
3 · Skip
4 · Delete test
5 · Document bug
6 · Not deployed
If user says "fix all selector drifts" — apply across the category without asking per item.
Mode: Validate — Test Quality Review
The core question: would this test fail if the feature broke? If not → reject.
The Business Value Test (MOST IMPORTANT)
- "What business capability does this test verify?"
- "If this test passes but the feature is broken, is that possible?"
If answer to #2 is YES → IMMEDIATE REJECTION
Anti-Patterns (Auto-Reject)
- Only navigation + element counting
- Click + wait for element that was already visible
- Form field presence check without filling + submitting
- Page load + title check only
- UI element visible without verifying it works
Minimum Quality Requirements
- ≥ 5 meaningful steps
- ≥ 2 assertions (Get Text, Should Be, Wait For)
- Verifies state change (before/after OR API response OR persistence)
- Has
[Documentation]with aPROTECTS:line naming the specific user complaint - Uses stable selectors
- Tags:
priority:?andfeature:?required
Mutation Resistance Check
Mentally introduce a realistic bug (e.g. "save button onClick removed") and ask: does this test catch it? If not → score 7+.
Bullshit Score (1–10)
| Score | Meaning |
|---|---|
| 1–3 | Solid — behavioral assertions, mutation-resistant |
| 4–6 | Mediocre — some value but weak |
| 7–9 | Mostly bullshit — navigation only, no real behavior |
| 10 | Pure bullshit — single Go To, Sleep with no assertion |
Score ≤ 4 → PASS. Score ≥ 5 → REJECT
Output: Single Test
[score]/10 — ✅ PASS / ❌ REJECT
Test ID: [id]
Reason: [one sentence]
[What to fix if rejected]
Output: Batch
Table grouped by tier (Solid / Mediocre / Bullshit), then action menu:
Reply with numbers to act:
1. Delete [N] score-10 tests
2. Fix [N] misleading test names
3. Fix [N] vacuous assertions
4. Rewrite [N] mediocre tests
5. Investigate [N] failing tests
all — do everything
When user replies: execute without asking further. Delete score-10 tests immediately. For rewrites, show diff then call helpmetest_upsert_test.
Key Principles
- Reproduce before fixing — never guess, always verify interactively
- Code may not be deployed — check
git diff HEADbefore calling something broken - Tests and code are both sources of truth — neither wins automatically
- Don't weaken assertions to make tests pass — fix the root cause
- All findings go into Feature artifacts — a bug mentioned only in chat doesn't exist
- Update Feature.status after any change: "working" | "broken" | "partial"
Version: 0.1
More from help-me-test/skills
helpmetest-test-generator
Use this skill when the user wants tests written for a specific feature or flow. Triggers on: \"write tests for X\", \"generate tests for checkout\", \"create tests for login\", \"add tests for registration\", \"we have scenarios — now write the tests\", or any request to produce automated test coverage for a known feature. Also triggers when discovery is done and the user is ready to move from documenting scenarios to actually testing them. Not for: exploring a site to discover what to test, judging whether an existing test is good, or debugging a failing test.
26helpmetest-discover
Use this skill when the user doesn't yet know what to test. This is the \"learn the site first\" step — for unfamiliar websites, new projects, or any situation where Feature/Persona artifacts don't exist yet. Use when the user: gives a URL with no specific test in mind, asks what features or flows a site has, wants to explore or walk through a site, is new to a project, or says \"explore before we test\". Also use for bare \"test [URL]\" commands with no further context. Do not use when Feature artifacts already exist or the user references specific known tests or bugs.
23validate-tests
Invoke this skill when a user shares test code and questions whether it actually works as intended — not to run or fix the test, but to evaluate whether the test has real value. Triggers on: \"is this test any good?\", \"would this catch a real bug?\", \"this test always passes — is that normal?\", \"review these tests before I commit\", or \"does this test verify anything meaningful?\". Also triggers when someone suspects a test is useless, wants a pre-commit quality gate, or is unsure if an auto-generated test is worth keeping. The core question this skill answers: \"Would this test fail if the feature broke?\" If not, the test gets rejected. Do NOT use for generating new tests, fixing failing tests, or exploring application features.
7helpmetest-troubleshoot
Triage skill for when something stopped working and you don't know why. Use when user says 'what's wrong', 'it was working a minute ago', 'something broke', 'everything is down', 'why isn't this working', 'tests are failing', 'it stopped working after the deploy', or any vague 'something is off' signal. Gathers the full picture fast — failing tests, health checks, recent code changes, recent deployments — then correlates them to produce a specific diagnosis and handoff to the right fix. Do NOT use for: debugging a known specific test failure (use helpmetest-debugger), bulk-fixing many failing tests (use helpmetest-self-heal), or writing new tests (use helpmetest-test-generator).
7