QA Test

Verify implemented features in a real browser. Exercise each acceptance criterion, verify via snapshots, report results.

Context-efficient design: browser testing runs in a sub-agent so snapshot/interaction data stays out of the main thread. Main thread only sees compact pass/fail summaries.

Process

Pre-flight (sub-agent) — gather criteria, resolve URL, check environment
Interactive setup — human steers browser for hard-to-automate steps (login, drag, etc.)
Browser testing (sub-agent) — exercises all criteria in isolated context
Report results — main thread receives compact summary only
Handle failures — retry failed criteria after manual intervention if needed

1. Pre-flight Sub-agent

Launch an Explore sub-agent before any browser interaction to gather all context in parallel.

Sub-agent prompt:

Gather QA pre-flight context for testing. Return a structured JSON block with:

1. **acceptance_criteria**: List of testable criteria. Check these sources in order,
   stop at the first that has criteria:
   - **The shape doc referenced by the implementation plan** (canonical source):
     look for `## Acceptance Criteria` section in `thoughts/plans/*.md` — it should
     link to a `thoughts/research/*.md` shape doc. Read that shape doc's
     `### Acceptance Criteria` section and use those criteria verbatim.
   - The user's prompt (if criteria were given explicitly)
   - Current plan file (if it lists criteria directly — unshaped work path)
   - Current PR description: run `gh pr view --json body` via Bash
   - Current branch diff: run `git diff main...HEAD --stat` then read changed files
     to infer what user-visible behavior changed (LAST RESORT — unshaped work only)
   - Linked issue: check PR body for issue references, fetch with `gh issue view`

   Return `criteria_source` alongside the criteria so the main thread knows which
   tier was used (shape doc / prompt / plan / PR / diff / issue).

2. **test_url**: Where to test. Check in order:
   - `.tap/tap-audit.md` Environments section
   - `package.json` scripts for `dev`, `start`, or similar
   - Common defaults: localhost:3000, :5173, :4321, :6886

3. **app_running**: Try to fetch the test_url via `curl -s -o /dev/null -w '%{http_code}'`.
   Return the status code. If not running, return the dev command that would start it.

4. **test_pages**: List of specific page URLs/routes to visit based on the changed files
   (e.g., if `modules/campaigns/` changed, the test page is likely `/campaigns/...`)

5. **db_available**: Check if postgres MCP tools are available (search for
   `mcp__postgres__execute_sql` or similar). Return true/false.

6. **has_async_flows**: Based on the changed files, flag whether the feature involves
   background jobs (Temporal workflows, queues, webhooks) that need async verification.

7. **needs_login**: Whether the app requires authentication. Check for login pages,
   auth middleware, or session requirements in the codebase.

Return results as a structured summary, not raw tool output.

Using the pre-flight results:

If app_running is not 200, start the dev server (background) and wait for it
Use acceptance_criteria as the test plan
Use test_pages to know where to navigate first
If db_available, include database verification steps
If has_async_flows, use the async testing pattern
If needs_login, prompt user for interactive setup before launching browser sub-agent

Fallback (no sub-agent): If sub-agents are unavailable, gather criteria and resolve URL sequentially.

Gather acceptance criteria from (in priority order):

Shape doc referenced by the implementation plan (thoughts/research/*.md via thoughts/plans/*.md) — canonical source when it exists
Explicit criteria provided in the prompt
Current plan file if it lists criteria directly (unshaped path)
Current ticket/issue (if referenced)
PR description
Diff inference — last resort, unshaped work only

If no criteria found, ask in human mode. In agent mode, infer from the diff.

Resolve test URL (in priority order):

URL provided in the prompt
.tap/tap-audit.md → Environments section
package.json scripts → dev, start, or similar
Common defaults: http://localhost:3000, http://localhost:5173, http://localhost:4321

Verify the app is running before proceeding.

2. Interactive Setup (Main Thread)

Before launching the browser testing sub-agent, handle anything that's hard to automate in the main thread. The browser state persists since the sub-agent connects to the same Chrome instance.

When to prompt for interactive setup:

needs_login is true → ask user: "App requires login. Want me to navigate to login page so you can sign in, or should I attempt automated login?"
Complex drag-and-drop or gesture-based preconditions
Multi-factor auth, CAPTCHAs, OAuth popups

What to do:

Navigate to the relevant page via Chrome MCP
Tell the user what action is needed
Wait for user confirmation that setup is complete
Then launch the browser testing sub-agent

If no interactive setup is needed, skip directly to step 3.

3. Browser Testing (Sub-agent)

Launch a general-purpose sub-agent for all browser interaction. This keeps snapshot/interaction data out of the main thread context.

Sub-agent prompt template:

You are running browser-based QA tests. The browser is already open and may already
be logged in / set up.

Test URL: {test_url}
Acceptance criteria to verify:
{numbered list of criteria}

Additional context:
- DB tools available: {db_available}
- Has async flows: {has_async_flows}
- Test pages: {test_pages}

## Evaluator mindset

You are an independent evaluator. Your job is to be skeptical, not helpful. The
generator already believes its work is correct — your value is catching what it
missed. Rules:

- **Binary outcomes**: a criterion fully passes or it fails. If you catch yourself
  writing "PASS with caveats" or "close enough", mark it **FAIL** and state the gap.
- **Evidence required**: for each PASS, cite what you observed — element text, URL
  after action, console/network status, DB row. PASS with no cited evidence → FAIL.
- **Do not rationalize**: if something looked off but "probably works", mark it FAIL
  or PARTIAL and describe what looked off. Let the human/agent decide if it's
  acceptable — that's not your call.
- **Specific bugs, not vague assessments**: when a criterion fails, file a concrete
  actionable finding ("Delete button calls `/api/items/:id` which returns 500;
  expected 204") — not "delete doesn't work."

## How to test

Use Chrome MCP tools (`mcp__chrome-devtools__*`).

**Snapshot-first workflow** — use `take_snapshot` for BOTH finding elements AND
verifying results. Do NOT use `take_screenshot` unless a criterion fails and you
need visual debugging evidence.

**For each criterion:**
1. Navigate to the relevant page
2. `take_snapshot` → get element UIDs and current state
3. Interact via UIDs (`click`, `fill`, `hover`)
4. `take_snapshot` → verify state changed as expected
5. Check `list_console_messages` for errors
6. Check `list_network_requests` for failed requests (4xx, 5xx)

**Important**: UIDs are ephemeral — always take a fresh snapshot before interacting.

**On failure only**: `take_screenshot` and save to `./qa-evidence/` for debugging.

**React/SPA hover interactions:**
Chrome DevTools `hover` only triggers CSS `:hover`, NOT JS `mouseenter`/`mouseover`.
If a UI element only appears via React's `onMouseEnter`:
1. Try `click` directly in the area
2. If that fails, `evaluate_script` to dispatch mouseenter event
3. `take_snapshot` to confirm

**Testing patterns:**
- Form submission: fill → submit → snapshot to verify success + check no errors
- Navigation: click → snapshot to verify new state + check URL
- State changes: trigger action → snapshot to verify → reload → snapshot to verify persistence
- Async: trigger → snapshot for intermediate state → poll snapshots → verify final state
- Error states: trigger invalid input → snapshot to verify error messaging

**Always check:**
- Console errors (JS exceptions)
- Failed network requests (4xx, 5xx)

## Report format

Return ONLY a compact summary in this exact format:

RESULT: [PASS / FAIL / PARTIAL]

CRITERIA:
1. [criterion] — PASS/FAIL — [one-line observation]
2. [criterion] — PASS/FAIL — [one-line observation]
...

ERRORS: [any console errors or failed network requests, or "none"]

FAILURES: [for any failed criterion: what happened, what was expected,
screenshot path if captured]

NEEDS_MANUAL: [any criteria that couldn't be tested due to automation
limitations — e.g., drag-and-drop, complex gestures]

4. Report Results

The sub-agent returns a compact summary. Present it to the user.

Human mode: Show the summary. If any failures, ask: "Want me to fix this and re-test, or is this expected?"

Agent mode: If all pass, proceed (e.g., open PR). If any fail, attempt fix-and-retest.

5. Failure Handling

Automation failures (NEEDS_MANUAL):

The user performs the manual action in the browser (main thread)
Launch a new sub-agent to verify only the remaining criteria
The new sub-agent picks up the browser state left by the user

Code failures (FAIL):

Agent mode:

Fix the code
Launch new sub-agent to re-test only failed criteria
Max 2 fix-and-retest cycles

After 2 failed cycles, escalate — do NOT keep iterating. Branch by failure shape:

Same criterion failing the same way both cycles → the plan is likely wrong, not the code. Re-enter /dev-skills:implementation-planning with the failing criterion and the observed behavior; do not attempt a third code fix.
Different criteria failing each cycle / shifting failures → hand to human with a consolidated diff: for each failing criterion, expected: … vs observed: … plus the specific bug finding from the evaluator. Stop.

Human mode:

Present failures
Ask: "Want me to fix this and re-test, or is this expected behavior?"

Optional: Database Verification

Include in the sub-agent prompt when db_available is true. For features that create or modify data:

Record creation: Verify expected rows exist with correct values
Relational data: Confirm junction table rows were created
Status transitions: Confirm async workflows completed

Boundaries

Does NOT write unit tests (that's implement-acceptance-tests)
Does NOT review code quality (that's CLAUDE.md / code review)
Does NOT assess blast radius (that's /blast-radius)
Tests user-visible behavior in the browser, with optional database verification
Does NOT modify acceptance criteria — tests what was specified

qa-test