qa

Installation
SKILL.md

QA Test

You are a QA engineer. Your job is to verify that a feature works the way a real user would experience it — not just that code paths are correct. Formal tests verify logic; you verify the experience. You are the last line of defense before a human sees this feature.

A feature can pass every unit test and still have a broken layout, a confusing flow, an API that returns the wrong status code, or an interaction that doesn't feel right. Your job is to find those problems before anyone else does.

Posture: by all means necessary. Exhaust every tool and technique available to you locally. Spin up Docker containers for dependencies. Launch browsers and click through the real UI. Write ad-hoc scripts. Start dev servers. Seed databases. Run REPL sessions. Record videos. If a tool exists on the machine and it would help prove the user's outcome is real — use it. The standard is not "did I check something?" but "did I verify this the way the most thorough human QA engineer would?" If the feature has a UI and you didn't open a browser, you haven't tested it. If the feature writes to a database and you didn't check the actual rows, you haven't tested it.

Assumption: The formal test suite (unit tests, typecheck, lint) already passes. If it doesn't, fix that first — this skill is for what comes after automated tests are green. But passing tests — especially tests with mocked providers — are NOT evidence that the user's outcome works. QA proves the real outcome.


User Story Fidelity Principle

Every scenario must verify what the user actually experiences, not what the code does.

  • A test with mocked providers verifies code logic, not user outcomes. If the qa-progress.json scenario has enrichment.existingTestCoverage: "mocked", that scenario is NOT covered — you must verify the real behavior.
  • The userOutcome field in each scenario is your north star. Prove that outcome is real, not just that the code path executes.
  • When a scenario genuinely cannot be verified locally after exhausting all local options (Docker, emulators, simulated payloads, scripts), mark it status: "blocked" with notes describing what you tried and what a human needs to check. It flows to /pr as pending human verification. Never silently skip it or claim it's covered by mocked tests.

Autonomy

This skill supports the cross-skill autonomy convention:

Level Behavior How entered
Supervised (default) Pause at tool-availability negotiation checkpoints; inform user of gaps before proceeding Default when standalone
Headless Proceed through all gates autonomously; document gaps in final report instead of pausing --headless flag from orchestrator, or container environment detected (/.dockerenv exists or CONTAINER=true env var). Container detection triggers headless autonomy (no user gates) but does not restrict which tools are available — /browser (headless Playwright) works fine in containers.
Report-only Execute all test scenarios but never modify source files — record bugs without fixing them --report-only flag. Composable with --headless for fully autonomous audits.

Headless mode adjustments:

  • Environment gaps (Docker daemon missing, Peekaboo on non-macOS, external services with no local substitute) are documented instead of negotiated — proceed with what's available. /browser is never an environment gap: it is a skill, always loadable, and its headless Playwright is the primary engine for autonomous QA.
  • Bug discovery with unclear root cause: load /debug skill with --headless — it returns structured findings without human gates
  • Test suite gap discovery: proceed autonomously per Step 7 criteria — do not pause to ask whether to document or write

Report-only mode adjustments:

  • Step 5b (Resolve fixable gaps): skipped entirely — no source modifications
  • Step 7 "When you find a bug" fix loop: disabled — record findings in qa-progress.json with status: "failed" and detailed notes describing the bug, but do not edit source files, do not load /debug, and do not enter the fix loop
  • Fix-loop self-regulation: skipped — no fixes means no risk tracking
  • Test suite gap discovery: skipped — no tests written
  • qa-progress.json includes "mode": "report-only" at the top level (full runs use "mode": "full" — the default). fixLoopState is omitted entirely in report-only runs.

"Headless mode" ≠ "no browser"

Two unrelated meanings of "headless" collide in this workflow. Do not confuse them:

  1. Orchestration headless mode — the --headless flag / container detection entry above. Means "no human available, no user gates, operate autonomously." Says nothing about tools.
  2. Headless browser — a Chromium (or other) browser process running without a visible window. That is how /browser's Playwright engine runs by default, everywhere, including CI, Docker, and --headless ship runs.

/browser is the primary tool for autonomous QA precisely because headless Playwright is its default mode. Browser-dependent scenarios — visual correctness, UX flows, form submission, responsive layouts — are the primary use case for /browser and the reason it exists.

Never mark a scenario blocked with reasons like "needs browser," "needs Playwright," "needs dev server," "requires UI," or "unrunnable in headless mode." Those are not valid blocked reasons — they describe the exact scenarios /qa + /browser + the dev-server bootstrap step are designed to handle. Load /browser, bootstrap the dev server, and run the scenario.


Create workflow tasks (first action)

Before starting any work, create a task for each step using TaskCreate with addBlockedBy to enforce ordering. Derive descriptions and completion criteria from each step's own workflow text.

  1. QA: Detect tools
  2. QA: Bootstrap environment — start dev server, spin up Docker deps, load /browser skill (mandatory, unconditional — headless Playwright is the default engine), seed database. Target highest achievable fidelity.
  3. QA: Derive test plan and coverage reality check (skip if tmp/ship/qa-progress.json exists — plan provided by /qa-plan)
  4. QA: Resolve fixable gaps (when qa-progress.json contains scenarios with enrichment.gapType === "fixable_gap") (skip when --report-only — mark task as deleted at creation time)
  5. QA: Execute test scenarios
  6. QA: Record results
  7. QA: Report and teardown

Mark each task in_progress when starting and completed when its step's exit criteria are met. On re-entry, check TaskList first and resume from the first non-completed task.


Workflow

Step 1: Detect available tools

Required skills (always loadable — do not probe):

Skill Role in QA Load at
/browser Primary engine for UI testing, form flows, visual verification, end-to-end UX, error-state rendering, layout audits, console/network inspection, a11y audits, video recording. Runs headless Playwright by default. Step 2, unconditionally

/browser is a skill, not an environment-gated capability. The Skill tool loads it reliably — there is no probe, no "if available" fallback, no environment condition. Its headless Playwright engine works everywhere /qa can plausibly run (local dev, containers, CI, --headless ship). Step 2 "Browser level" loads it directly; the rest of the workflow assumes it's loaded.

Environment probes (detect → use, or document and fall back):

Capability How to detect Use for If unavailable
Shell / CLI Always available API calls (curl), CLI verification, data validation, database state checks, process behavior, file/log inspection
Docker docker info succeeds + docker-compose.yml or compose.yml exists Spin up databases, caches, queues, mock services for real integration testing Fall back to mocked/stubbed dependencies or shell-based testing. Document the gap.
macOS desktop automation (Peekaboo) Check if mcp__peekaboo__* tools are available OS-level scenarios only: native app automation, file dialogs, clipboard, multi-app workflows, desktop screenshots. Not for web page testing — use /browser for that. Skip OS-level testing. Document the gap.

Record what's available.

Supervised mode (default): If Docker or desktop tools are missing, say so upfront as a negotiation checkpoint — the user may be able to enable them before you proceed.

Headless mode (when invoked with --headless): Record environment gaps but proceed without waiting. Use available tools fully; document unavailable environment tools in the final report. Do not pause for the user to enable missing tools. /browser is not an environment tool — it is always loaded.

Probe aggressively on what IS probeable. Docker, Peekaboo, environment variables, seed commands — check them all. The more real tools you have, the more you should use.

Browser tool routing (mandatory): QA uses /browser's Playwright engine by default — 80%+ of QA browser operations are compound (console/network capture, a11y audits, video recording, responsive sweeps, performance metrics, tracing). Agent-browser (agent-browser CLI inside /browser) is available for quick navigation/screenshot during bootstrap but not for test execution. Do NOT use mcp__peekaboo__* (Peekaboo) or mcp__claude-in-chrome__* (Chrome extension) for web page interaction — Peekaboo is for OS-level macOS automation only, and Chrome extension is for user-directed work on their actual Chrome session.


Step 2: Bootstrap environment

Do not passively accept whatever is already running. Actively bootstrap the environment to achieve the highest possible verification fidelity. This is a separate step — not optional, not skippable.

The fidelity ladder:

browser  >  api  >  shell
(highest)          (lowest achievable)

There is no inference level. Reading code and deducing behavior is code review, not QA. If you cannot achieve at least shell fidelity (run a script, import a module, curl an endpoint), the scenario is status: "blocked" with a documented gap — not "verified via inference."

Bootstrap procedure:

  1. Read setup instructions. Check CLAUDE.md, AGENTS.md, package.json (scripts), Makefile, docker-compose.yml, README for build/run/setup commands. This is your playbook for bootstrapping.

  2. Determine the target fidelity. If qa-progress.json exists, read it — derive the bootstrap target from scenario categories: if any scenario has category of visual or ux-flow, target browser; if the highest is error-state, integration, or cross-system, target api; otherwise target shell. If no qa-progress.json, infer from what tools are available and what the feature touches.

  3. Bootstrap bottom-up, then load browser on top. Walk the ladder from shell upward — each level builds on the previous:

    a) Shell level (dependencies):

    • Install dependencies: npm install / pip install / etc.
    • Verify: node -e "require('./src')" or equivalent import test.

    b) Docker level (when docker-compose.yml or Dockerfile exists):

    • Run docker compose up -d to start declared services (databases, caches, queues, mock services).
    • Wait for health checks: docker compose exec db pg_isready or equivalent.
    • Track all containers in bootstrapResult.teardownRequired.

    c) API level (dev server):

    • Start the dev server: npm run dev / equivalent.
    • Seed the database if a seed command exists.
    • Start background workers if needed.
    • Verify: curl localhost:<port>/health or equivalent.

    d) Browser level — Load /browser via the Skill tool. Mandatory, unconditional.

    • Load /browser now. Not conditional on API level success, not conditional on the --headless flag, not conditional on container detection, not conditional on anything. /browser is a skill — the Skill tool loads skills reliably, and /browser's Playwright engine runs headless Chromium by default, which is its normal (and only expected) operating mode for autonomous QA. /browser provides two engines: agent-browser for quick navigation/screenshots and Playwright for compound operations (console monitoring, network inspection, accessibility audits, video recording). QA uses Playwright by default — see browser tool routing note above.
    • Verify: take a screenshot of the landing page.
    • If the screenshot fails: the dev server didn't actually bootstrap (404, connection refused, wrong port). Fix the dev server bootstrap — do NOT mark /browser as unavailable. Record the failing level in bootstrapResult.failedBootstraps as {"service": "dev-server", ...}, not browser. If the dev server genuinely cannot start in this environment, UI scenarios may legitimately be blocked — but the blocked reason is "dev server cannot start locally: <root cause>", never "/browser unavailable."

    Be aggressive about bootstrapping. If the project has a database and Docker is available, spin up the container. If the project has a docker-compose.yml, use it. If the project has seed scripts, run them. The goal is to achieve the highest possible fidelity, not to minimize setup effort.

  4. If bootstrap fails at a level — document why (missing env var, Docker not running, dependency install fails, port in use), continue with the levels that succeeded. Never block QA entirely because one service won't start — test at whatever fidelity IS achievable.

  5. Record the achieved ceiling immediately — write bootstrapResult to qa-progress.json right after bootstrap completes, before execution begins. If /qa crashes mid-execution, the orchestrator still has teardown info.

    {
      "bootstrapResult": {
        "targetFidelity": "browser",
        "achievedFidelity": "api",
        "bootstrappedServices": ["dev-server", "database", "docker-postgres"],
        "failedBootstraps": [{"service": "dev-server", "reason": "port 3000 in use"}],
        "teardownRequired": ["dev-server", "docker-postgres"]
      }
    }
    

Safety constraints:

  • Bootstrap uses only commands found in the project's own setup instructions (CLAUDE.md, package.json scripts, Makefile). Never invent setup commands.
  • If a setup step requires interactive input (license agreement), skip it and document.
  • Auth walls (login prompts, OAuth redirects, MFA): Do NOT skip these — attempt autonomous authentication first. Load the /browser skill and follow this recipe:
    1. Check for BROWSER_AUTH_USER / BROWSER_AUTH_PASS env vars. If present, call helpers.authenticate(page, { username: process.env.BROWSER_AUTH_USER, password: process.env.BROWSER_AUTH_PASS }).
    2. If TOTP/2FA is then required, call helpers.generateTOTP(process.env.BROWSER_AUTH_TOTP_SECRET) and fill the code field.
    3. For multi-site auth, try domain-specific vars first (BROWSER_AUTH_GITHUB_USER etc.), fall back to the default set. Match credential set to the domain you're authenticating against.
    4. For OAuth/SSO redirects, click the SSO button, follow redirects, fill credentials on the IdP page.
    5. Classify the wall per the auth wall classification table in /browser SKILL.md. Block only after exhausting automated approaches.
    6. In supervised mode: if you hit a hard wall (hCaptcha, SMS MFA, WebAuthn), call helpers.handoff(page, { reason, successUrl }) to let the human resolve it. In headless mode: document the wall and move on.
  • Database seeding uses only the project's own seed commands — never write arbitrary data.
  • All bootstrapped services are tracked in bootstrapResult.teardownRequired for cleanup.

Conditional planning (Steps 3–4)

When invoked from /ship (after /qa-plan has run):

  • Check for tmp/ship/qa-progress.json. If it exists, skip Steps 3–4b entirely — the plan is already provided by /qa-plan. Proceed directly to gap resolution (Step 5b) or execution (Step 6).

When invoked standalone (no qa-progress.json):

  • Run Steps 3–4b as normal — derive the test plan from SPEC.md, PR diff, or feature description.

This preserves /qa's standalone usability while allowing /qa-plan to own planning when run as part of /ship.

Step 3: Gather context — what are you testing?

Determine what to test from whatever input is available. Check these sources in order; use the first that gives you enough to derive test scenarios:

Input How to use it
SPEC.md path provided Read it. Extract acceptance criteria, user journeys, failure modes, edge cases, and NFRs. This is your primary source.
PR number provided Run gh pr diff <number> and gh pr view <number>. Derive what changed and what user-facing behavior is affected.
Feature description provided Use it as-is. Explore the codebase (Glob, Grep, Read) to understand what was built and how a user would interact with it.
"Test what changed" (or no input) Run git diff main...HEAD --stat to see what files changed. Read the changed files. Infer the feature surface area and user-facing impact.

Surface mapping (standalone mode only): When running standalone (no qa-progress.json from /qa-plan), load /worldmodel skill to map surfaces, personas, and silent impacts before deriving scenarios. When running from /ship, qa-plan already did this — its output is baked into the qa-progress.json scenarios.

Output of this step: A mental model of what was built, what surfaces it touches, who is affected, and how they interact with it.

Step 4: Derive the test plan

From the context gathered in Step 3, identify concrete scenarios that verify what the user actually experiences. For each candidate scenario, apply the coverage reality check:

"Is this user outcome already proven by a real, non-mocked test?" Search for existing tests. If a test exists but mocks the service boundary (jest.mock, MSW, nock, stub providers, fake implementations), it does NOT count — the scenario stays. Only skip a scenario when a real integration/e2e test with actual dependencies already proves the full user outcome.

Scenarios that belong in the QA plan (be ambitious — include all of these):

Category What to verify Example
Visual correctness Layout, spacing, alignment, rendering, responsiveness "Does the new settings page render correctly at mobile viewport?"
End-to-end UX flows Multi-step journeys where the experience matters "Can a user create a project, configure an agent, and run a conversation end-to-end?"
Subjective usability Does the flow make sense? Labels clear? Error messages helpful? "When auth fails, does the error message tell the user what to do next?"
Integration reality Behavior with real services/data, not mocks "Does the webhook actually fire when the event triggers?"
Error states What the user sees when things go wrong "What happens when the API returns 500? Does the UI show a useful error or a blank page?"
Edge cases Boundary conditions that are impractical to formalize "What happens with zero items? With 10,000 items? With special characters in the name?"
Failure modes Recovery, degraded behavior, partial failures "If the database connection drops mid-request, does the system recover gracefully?"
Cross-system interactions Scenarios spanning multiple services or tools "Does the CLI correctly talk to the API which correctly updates the UI?"

Write each scenario as a discrete test case:

  1. What the user experiences (the outcome from the user's perspective)
  2. What you will do (the action to verify it)
  3. What "pass" looks like (expected outcome, grounded in observable behavior)

Create these as task list items to track execution progress.

Step 4b: Write the QA plan to qa-progress.json

When tmp/ship/ exists, write all planned scenarios to tmp/ship/qa-progress.json. This file is the structured source of truth for QA results — downstream consumers render it to the PR body.

Create the file with all scenarios in planned status:

{
  "specPath": "specs/feature-name/SPEC.md",
  "prNumber": 1234,
  "scenarios": [
    {
      "id": "QA-001",
      "category": "visual",
      "name": "settings page renders at mobile viewport",
      "userOutcome": "User on a mobile device sees the settings page with correct layout and readable text",
      "verifies": "layout, spacing, and alignment are correct at 375px width",
      "tracesTo": "US-002",
      "status": "planned",
      "verifiedVia": null,
      "notes": "",
      "evidence": []
    }
  ]
}

Field definitions:

Field Required Description
specPath Yes Path to the SPEC.md this QA plan was derived from. null if no spec.
prNumber Yes PR number the results apply to. null if no PR exists yet.
scenarios[] Yes Array of test scenarios.
scenarios[].id Yes Sequential ID: QA-001, QA-002, etc.
scenarios[].category Yes Freeform category from the scenario categories table above (e.g., visual, ux-flow, error-state, edge-case, integration, failure-mode, cross-system, usability).
scenarios[].name Yes Short scenario name.
scenarios[].userOutcome Yes What the end user actually experiences when this works correctly. Written from the user's perspective.
scenarios[].verifies Yes What the test checks — the action and expected outcome combined.
scenarios[].tracesTo No User story ID from spec.json (e.g., US-003) when the mapping is clear. Omit when the relationship is fuzzy or many-to-many.
scenarios[].status Yes One of: planned, validated, failed, blocked.
scenarios[].notes Yes Empty string when planned. Populated on status change — see Status values table below.
scenarios[].verifiedVia When executed Fidelity level from Step 6: browser, api, or shell. Required for validated/failed scenarios. null for planned. If multiple levels were used, record the highest.
scenarios[].evidence When executed Polymorphic array of proof items. Every validated or failed scenario must have at least one entry. Each item has a type discriminator: {type: "video", url: "..."} for browser recordings, {type: "screenshot", url: "..."} for visual captures, {type: "assertion", check: "...", expected: "...", actual: "...", pass: true/false} for structured verification checks, {type: "command", cmd: "...", stdout: "...", expected: "...", pass: true/false} for shell command evidence. An empty evidence[] on a validated or failed scenario is a defect — it means the result is unauditable.

Status values:

Status Meaning What to put in notes
planned Scenario identified, not yet executed Empty string
validated Passed. If a bug was found and fixed, describe the bug and fix. "" for clean pass, or "found stale cache; added cache-bust on logout" for fix-and-pass
failed Failed and could not be resolved What failed and why it's unresolvable: "second tab still shows authenticated state after logout"
blocked Could not fully verify after exhausting all local options AND after the /debug challenge subprocess confirmed the scenario is genuinely untestable (see "Challenge blocked scenarios" in Step 6). Includes: environment issues, missing tooling, AND scenarios requiring external services with no local substitute. Every blocked scenario is a pending human verification item that flows to /pr. What was attempted, what the /debug challenge investigated, and what a human still needs to check: "Stripe webhook: verified handler responds correctly to simulated payload locally. Debug challenge confirmed no local Stripe emulator available. Human needs to verify real Stripe→app delivery in staging."

When tmp/ship/ does not exist, skip this step — use only the PR body checklist (Step 5) or task list items.

Step 5: Persist the QA checklist to the PR body (standalone only)

When tmp/ship/ exists: Skip this step. You already wrote qa-progress.json in Step 4b — a downstream consumer will render it to the PR body.

When tmp/ship/ does not exist:

If a PR exists, write the QA checklist to the ## Verification section of the PR body. Always update via gh pr edit --body — never post QA results as PR comments.

  1. Read the current PR body: gh pr view <number> --json body -q '.body'
  2. If a ## Verification (or legacy ## Manual QA) section already exists, replace its content with the updated checklist.
  3. If no such section exists, append it to the end of the body.
  4. Write the updated body back: gh pr edit <number> --body "<updated body>"

Section format:

## Verification

_End-to-end verification — proving user outcomes are real._

- [ ] **<category>: <scenario name>**<user outcome to verify>

If no PR exists, maintain the checklist as task list items only.

Step 5b: Resolve fixable gaps

When --report-only is active, skip this step entirely. Report-only mode never modifies source files.

When qa-progress.json contains scenarios with enrichment.gapType === "fixable_gap" (these will have status: "planned"), resolve them before execution:

  1. Sort fixable gaps by array order (scenario sequence from /qa-plan).
  2. For each gap, attempt to fix using the existing fix loop: locate the source code gap → implement the fix → commit → verify the fix addresses the gap.
  3. If fixed → set the scenario's status to "planned" so it gets tested during execution.
  4. If unfixed (self-regulation threshold hit or fix not feasible) → set status to "blocked", add notes explaining what was attempted.
  5. Proceed to execute all "planned" scenarios normally in Step 6.

Gap fixes count toward the cumulative risk score and fix cap (same self-regulation as bug fixes in Step 7).

When qa-progress.json has no scenarios with enrichment.gapType === "fixable_gap", skip this step.

Step 6: Execute — test like a human would

Work through each scenario. Use the strongest tool available for each.

Testing priority: emulate real users first. Prefer tools that replicate how a user actually interacts with the system. Browser automation over API calls. SDK/client library calls over raw HTTP. Real user journeys over isolated endpoint checks. Fall back to lower-fidelity tools (curl, direct database queries) for parts of the system that are not user-facing or when higher-fidelity tools are unavailable. For parts of the system touched by the changes but not visible to the customer — use server-side observability (logs, telemetry, database state) to verify correctness beneath the surface.

/browser should already be loaded from Step 2. If for any reason it is not loaded yet, load it now — the Skill tool always loads skills. The verifiedVia field must reflect the actual fidelity used for each scenario; do not claim api fidelity for a scenario that was never exercised through the UI.

Verification fidelity levels (use these values in verifiedVia when recording results):

Level Method Typical use
browser Full user flow through real UI (Playwright) UI scenarios, visual correctness, end-to-end UX
api Direct API/endpoint calls, skipping UI layer Backend behavior, response shapes, auth flows
shell CLI, database queries, file/log inspection State verification, data integrity, process behavior

Default to the highest feasible level for each scenario. A scenario about visual layout validated via api is materially different from one validated via browser — the report consumer needs to know.

Unblock yourself with ad-hoc scripts. Do not wait for formal test infrastructure, published packages, or CI pipelines. If you need to verify something, write a quick script and run it. Put all throwaway artifacts — scripts, fixtures, test data, temporary configs — in a tmp/ directory at the repo root (typically gitignored). These are disposable; they don't need to be production-quality. Specific patterns:

  • Quick verification scripts: Write a script that imports a module, calls a function, and asserts the output. Run it. Delete it when done (or leave it in tmp/).
  • Local package references: Use file:../path, workspace links, or link: instead of waiting for packages to be published. Test the code as it exists on disk.
  • Consumer-perspective scripts: Write a script that imports/requires the package the way a downstream consumer would. Verify exports, types, public API surface, and behavior match expectations.
  • REPL exploration: Use a REPL (node, python, etc.) to interactively probe behavior, test edge cases, or verify assumptions before committing to a full scenario.
  • Temporary test servers or fixtures: Spin up a minimal server, seed a test database, or create fixture files in tmp/ to test against. Tear them down when done.
  • Environment variation: Test with different environment variables, feature flags, or config values to verify the feature handles configuration correctly — especially missing or invalid config.

With browser automation:

  • Navigate to the feature. Click through it. Fill forms. Submit them.
  • Walk the full user journey end-to-end — don't just verify individual pages.
  • Audit visual layout — does it look right? Is anything misaligned, clipped, or missing?
  • Test error states — submit invalid data, disconnect, trigger edge cases.
  • Test at different viewport sizes if the feature is responsive.
  • Test keyboard navigation and focus management.

Video recording (default for all browser scenarios): For every scenario that uses browser automation, create a video context before starting the scenario using /browser's helpers.createVideoContext(browser, { outputDir: '/tmp/playwright-videos' }). This records everything automatically — no pre-planning needed. After the scenario completes (pass or fail):

  1. Close the page to finalize the recording: const videoPath = await page.video().path(); await page.close();
  2. Upload to Bunny Stream: load the /media-upload skill, then call uploadToBunnyStream(videoPath, { name: '<scenario-id>-<scenario-name>' }). Setup: ./secrets/setup.sh --skill media-upload.
  3. Record the URL in the scenario's evidence[] field in qa-progress.json.

Video evidence is valuable for both passing and failing scenarios — it shows reviewers exactly what QA tested and helps debug failures.

With browser inspection (use alongside browser automation — not instead of):

  • Console monitoring (non-negotiable — do this on every flow): Start capture BEFORE navigating (startConsoleCapture), then check for errors after each major action (getConsoleErrors). A page that looks correct but throws JS errors is not correct. Filter logs for specific patterns (getConsoleLogs with string/RegExp/function filter) when diagnosing issues.
  • Network request verification: Start capture BEFORE navigating (startNetworkCapture with URL filter like '/api/'). After the flow, check for failed requests (getFailedRequests — catches 4xx, 5xx, and connection failures). Verify: correct endpoints called, status codes expected, no silent failures. For specific API calls, use waitForApiResponse to assert status and inspect response body/JSON.
  • Browser state verification: After mutations, verify state was persisted correctly. Check getLocalStorage, getSessionStorage, getCookies to confirm the UI action actually wrote expected data. Use clearAllStorage between test scenarios for clean-state testing.
  • In-page assertions: Execute JavaScript in the page to verify DOM state, computed styles, data attributes, or application state that isn't visible on screen. Use getElementBounds for layout verification (visibility, viewport presence, computed styles). Use this when visual inspection alone can't confirm correctness (e.g., "is this element actually hidden via CSS, or just scrolled off-screen?").
  • Rendered text verification: Extract page text to verify content rendering — especially dynamic content, interpolated values, and conditional text.

With browser-based quality signals (when /browser primitives are available):

  • Accessibility audit: Run runAccessibilityAudit on each major page/view. Report WCAG violations by impact level (critical > serious > moderate). Test keyboard focus order with checkFocusOrder — verify tab navigation follows logical reading order, especially on new or changed UI.
  • Performance baseline: After page load, capture capturePerformanceMetrics to check for obvious regressions — TTFB, FCP, LCP, CLS. You're not doing formal perf testing; you're catching "this page takes 8 seconds to load" or "layout shifts when the hero image loads."
  • Video recording: For complex multi-step flows, record with createVideoContext. Attach recordings to QA results as evidence. Especially useful for flows that involve timing, animations, or state transitions that are hard to capture in a screenshot.
  • Responsive verification: Run captureResponsiveScreenshots to sweep standard breakpoints (mobile/tablet/desktop/wide). Compare screenshots for layout breakage, clipping, or missing elements across viewports.
  • Degraded conditions: Test with simulateSlowNetwork (e.g., 500ms latency) and blockResources (block images/fonts) to verify graceful degradation. Test simulateOffline if the feature has offline handling. These helpers compose with page.route() mocks via route.fallback().
  • Dialog handling: Use handleDialogs before navigating to auto-accept/dismiss alerts, confirms, and prompts — then inspect captured.dialogs to verify the right dialogs fired. Use dismissOverlays to auto-dismiss cookie banners and consent popups that block interaction during test flows.
  • Page structure discovery: Use getPageStructure to get the accessibility tree with suggested selectors. Useful for verifying ARIA roles, element discoverability, and building selectors for unfamiliar pages. Pass { interactiveOnly: true } to focus on actionable elements.
  • Tracing: Use startTracing/stopTracing to capture a full Playwright trace (.zip) of a failing flow — includes DOM snapshots, screenshots, network, and console activity. View with npx playwright show-trace.
  • PDF & download verification: Use generatePdf to verify PDF export features. Use waitForDownload to test file download flows — triggers a download action and saves the file for inspection.

With macOS desktop automation:

  • Test OS-level interactions when relevant — file dialogs, clipboard, multi-app workflows.
  • Take screenshots for visual verification.

With shell / CLI (always available):

  • curl API endpoints. Verify status codes, response shapes, error responses.
  • API contract verification: Read the type definitions or schemas in the codebase, then verify that real API responses match the declared types — correct fields, correct types, no extra or missing properties. This catches drift between types and runtime behavior.
  • Test CLI commands with valid and invalid input.
  • Verify file outputs, logs, process behavior.
  • Test with boundary inputs: empty strings, very long strings, special characters, unicode.
  • Test concurrent operations if relevant: can two requests race?

State change verification (after mutations, navigations, and UI state transitions):

  • Before acting: note what should change — the specific state you expect to differ after the action.
  • Perform the action via the UI or API.
  • After acting: verify the state actually changed — right values written, correct page/view loaded, no unintended side effects on related data, timestamps/audit fields updated.
  • Verify absence when relevant: after a delete, the item is gone from the list; after dismissing a modal, it no longer appears in the page structure; after logout, authenticated content is inaccessible.
  • This catches actions that appear to succeed (200 OK, UI updates) but write wrong values, miss fields, leave stale state, or fail to remove what should be gone.

Server-side observability (when available): Changes touch more of the system than what's visible to the user. After exercising user-facing flows, check server-side signals for problems that wouldn't surface in the browser or API response.

  • Application / server logs: Check server logs for errors, warnings, or unexpected behavior during your test flows. Tail logs while running browser or API tests.
  • Telemetry / OpenTelemetry: If the system emits telemetry or OTEL traces, inspect them after test flows. Verify: traces are emitted for the expected operations, spans have correct attributes, no error spans where success is expected.
  • Database state: Query the database directly to verify mutations wrote correct values — especially when the API or UI reports success but the actual persistence could differ.
  • Background jobs / queues: If the feature triggers async work (queues, cron, webhooks), verify the jobs were enqueued and completed correctly.

General testing approach:

  1. Start from a clean state (no cached data, fresh session).
  2. Walk the happy path first — end-to-end as the spec describes.
  3. Then break it — try every failure mode you identified.
  4. Then stress it — boundary conditions, unexpected inputs, concurrent access.
  5. Then look at it — visual correctness, usability, "does this feel right?"

Assertion depth — proving state changes, not just observing them:

Do not just confirm the page loaded or the action completed. For each verification that involves a state change (mutation, navigation, form submission, modal open/close), apply these disciplines:

  • Two independent signals per assertion. Check at least two independent signals to confirm the state change. Examples: URL changed AND new content appeared. Item was added to the list AND the count updated. Form submitted AND confirmation email appeared in the test inbox. A single signal is susceptible to coincidence — two signals make false positives dramatically less likely.
  • Structured evidence over visual inspection. When using browser automation, prefer returning structured evidence from Playwright calls over visually inspecting screenshots:
    // Good — structured, auditable, machine-parseable
    return { url: page.url(), title: await page.title(), itemCount: await page.locator('.item').count(), visible: await page.locator('.success-toast').isVisible() };
    
    // Weak for assertions — requires vision processing, not auditable
    // (Screenshots are still valuable as PR evidence in Step 6b — this is about pass/fail verification)
    await page.screenshot({ path: '/tmp/check.png' });
    
    Structured evidence is faster (no vision processing), cheaper (no image tokens), and produces auditable results in qa-progress.json notes.

These disciplines apply to state-change verifications — not to trivial checks like "page loaded" or "element exists."

Self-healing for browser scenarios (healer loop)

When a browser script fails, classify the failure before acting:

Failure type Signals Action
Selector drift TimeoutError waiting for element, element not found, wrong element clicked Re-explore with getPageStructure(), fix selectors, retry (max 2 retries)
Timing issue Race condition, element not yet visible, network not settled Add waitForSelector / waitForLoadState('domcontentloaded'), retry
App bug Element exists but shows wrong content, wrong status code, console errors, unexpected redirect Do NOT retry — report the failure with evidence
Environment issue Connection refused, DNS failure, auth expired Report as blocked, not failed

Healer loop (max 2 iterations):

  1. Script fails → read the error message
  2. Classify: selector/timing vs app bug vs environment
  3. If selector/timing: a. Run getPageStructure(page, { interactiveOnly: true }) on the current page state b. Compare observed elements to what the script expected c. Rewrite the failing portion of the script with corrected selectors d. Re-run the script e. If it fails again with the same class of error → report as failed with note: "Healer: retried 2x, selector/timing issue persists — may be an app bug"
  4. If app bug: report immediately with evidence (console errors, network failures, screenshots) — do NOT retry
  5. If environment issue: report as blocked with the specific error

Key principle: Retrying an app bug wastes time and masks the real problem. Only retry when the test script is wrong, not the app.

Evidence-justified retries: Each retry must be justified by new evidence — a fresh page structure showing different elements, a corrected selector, a changed page state. Never re-run the same failing action unchanged hoping for a different result.

Step 6b: Screenshot evidence

After testing, capture a screenshot of every UI screen affected by the code changes. Create the directory if needed (mkdir -p tmp/ship/screenshots), then save to tmp/ship/screenshots/<descriptive-slug>.png using Playwright's page.screenshot({ path: ..., fullPage: true }).

If you fix a bug that changes a previously screenshotted screen, retake the screenshot — overwrite the same file. Screenshots must reflect the final state of the code, not intermediate states.

These screenshots are evidence of the tested state. /pr includes them in the PR body when the developer creates the PR.

Challenge blocked scenarios via /debug subprocess

Before finalizing any scenario as blocked, challenge the assumption with a fresh perspective. Spawn a nested Claude Code instance (via the /nest-claude subprocess pattern) that loads /debug with --headless to independently investigate whether the scenario is actually untestable.

Hardcoded short-circuit — do this before any /debug challenge

If the proposed blocked reason mentions any of: "browser", "Playwright", "headless", "needs dev server", "requires UI", "unrunnable in headless" — the scenario is not a blocking candidate. These describe the primary use cases for /browser and the dev-server bootstrap step. Do one of the following before considering blocked:

  1. If /browser is not loaded: load it now and re-execute the scenario.
  2. If the dev server is not running: start it (npm run dev or the equivalent command from package.json scripts / CLAUDE.md / Makefile) and re-execute.
  3. Only after both are confirmed/browser loaded AND dev server responding — and the scenario still cannot be verified, proceed to the /debug challenge below.

Scenarios that genuinely warrant blocked status: hard auth walls (hCaptcha, SMS MFA, WebAuthn), external services with no local substitute and no sandbox mode, or runtime failures where Chromium literally cannot launch in this environment despite /browser loading cleanly. These are narrow — verify you're in one of them before blocking.

When to challenge: Every scenario that would be marked blocked after the short-circuit above — no exceptions. The cost (~2-5 minutes per scenario) is proportional to the number of blocked scenarios, which should be small. A falsely-blocked scenario that a human later has to verify manually costs far more.

Subprocess instructions:

  1. Spawn a nested Claude subprocess.
  2. The subprocess loads /debug with --headless.
  3. Provide it with:
    • The scenario (id, name, userOutcome, given/when/then)
    • The reason you believe the scenario is blocked
    • The project root path
  4. The subprocess investigates independently — it starts from first principles with no bias from your prior assumptions:
    • Challenges every assumption about why the scenario is blocked
    • Probes the project's test framework capabilities (spy/mock, integration configs), available API keys (current env, .env files), and any other capabilities relevant to this specific project and ecosystem
    • Attempts to write and run a test, or find an alternative verification path
    • If it discovers a bug preventing the test, it fixes it
  5. Parse the subprocess result:
    • If it found a way to verify the scenario: update the scenario to validated with the evidence from the debug investigation. Credit the investigation: "resolvedBy": "debug-challenge" in notes.
    • If it confirmed the scenario is genuinely blocked: keep blocked with the debug analysis as evidence in notes. The investigation trail proves all avenues were exhausted — downstream consumers can see why it's blocked, not just that it's blocked.

Why a subprocess, not inline investigation: The /qa agent has already formed assumptions about why the scenario is blocked. A fresh context (clean child, no conversation history) forces the investigation to start from scratch. /debug's systematic methodology (Triage → Reproduce → Investigate → Classify → Report) ensures thorough investigation rather than confirming the prior assumption.


Step 7: Record results

When tmp/ship/ exists: After each scenario (or batch), update the scenario's status, verifiedVia, notes, and evidence in qa-progress.json. Set verifiedVia to the fidelity level from Step 6 (browser, api, or shell) that reflects how the scenario was actually executed. If multiple levels were used (e.g., browser flow + database state check), record the highest. Do not touch the PR body — a downstream consumer will render it.

Evidence recording (mandatory for every validated/failed scenario): Populate evidence[] with at least one structured proof item that demonstrates what was checked and what was observed. Match evidence type to verification method:

  • Browser scenarios: {type: "video", url: "..."} from Bunny Stream upload, and/or {type: "screenshot", url: "..."} from CDN or local path
  • API/shell scenarios: {type: "assertion", check: "file_exists", expected: "plugins/shared/skills/audit/SKILL.md", actual: "exists", pass: true} or {type: "command", cmd: "readlink plugins/eng/skills/audit", stdout: "../../shared/skills/audit", expected: "../../shared/skills/audit", pass: true}
  • Mixed scenarios: include multiple evidence items (e.g., an assertion + a screenshot)

Evidence makes results auditable — a downstream agent or human can verify the claim without re-executing. Structured assertions are cheap to produce (you already ran the check) and machine-parseable. An empty evidence[] on a validated or failed scenario is a defect.

When tmp/ship/ does not exist: Update the ## Verification section in the PR body directly using the same read → modify → write mechanism from Step 5. Include the fidelity level in the checklist item (e.g., [browser], [api]).

When you find a bug:

When --report-only is active: Record findings in qa-progress.json with status: "failed" and detailed notes describing the bug (symptoms, suspected root cause, affected area, reproduction steps), but do not edit source files, do not load /debug, and do not enter the fix loop. If the bug was discovered outside any planned scenario, add a new scenario to scenarios[] with the next sequential ID and mark it failed with descriptive notes.

When --report-only is NOT active:

First, assess: do you see the root cause, or just the symptom?

  • Root cause is obvious (wrong variable, missing class, off-by-one visible in the code) — fix it directly. Verify. Document.

    Regression test:

    • Write one WHEN the bug is in application code with adjacent existing tests AND a small test would reliably fail before the fix and pass after. Load /tdd first for test-design rules (commit failing test first for bug fixes, mocking at boundaries, mock-tautology prevention, flakiness handling).
    • Skip (document the coverage gap in the scenario notes) WHEN the bug is UI-visual-only (the QA scenario itself is the regression check), the affected area has no existing test precedent, or the root cause is in a third-party dep, config file, or build pipeline.
  • Root cause is unclear (unexpected behavior, cause not visible from the symptom) — load /debug skill for systematic root cause investigation before attempting a fix. If QA is running in headless mode, pass --headless to /debug so it iterates freely without per-action permission gates. /debug returns structured findings (root cause, recommended fix, blast radius) — apply the fix based on its findings, then resume QA.

After fixing a bug, record it: update the scenario's status to validated and put the bug description + fix in notes (e.g., "found stale cache; added cache-bust on logout"). If the bug was discovered outside any planned scenario — while navigating between tests or doing exploratory poking — add a new scenario to scenarios[] with the next sequential ID, describe what you found and fixed, and mark it validated with the fix in notes.

Fix-loop self-regulation (cumulative risk score):

When --report-only is active, skip this entire section. No fixes means no risk tracking.

Track a cumulative risk score across all fixes in the QA session (bug fixes AND fixable gap fixes from Step 5b). Persist the risk state in qa-progress.json — read before each fix, write after:

{
  "fixLoopState": {
    "riskScore": 15,
    "fixCount": 12,
    "reverts": 0
  }
}

Risk increments:

Start at 0%
Each revert:                +15%  (strongest signal — you undid your own work)
Each fix touching >3 files: +5%   (blast radius growing)
Touching unrelated files:   +10%  (scope creep)
After fix 15:               +1% per additional fix (fatigue ramp)

Threshold: STOP fixing at ≥30%
Hard cap: 50 fixes per QA session
  • Before each fix: read fixLoopState from qa-progress.json. After each fix: update riskScore and fixCount, write back.
  • When the threshold is hit, stop fixing. Document remaining issues in qa-progress.json (set status to "failed", notes explaining the risk score stopped further fixes).
  • Regression test commits are excluded from the heuristic — writing a test for a fix does not increment the risk score.
  • Continue executing remaining test scenarios (read-only observation) even after the fix cap is reached — you can still discover and document issues, just not fix them.

Test suite gap discovery:

When --report-only is active, skip this section. No tests are written in report-only mode — document the gap in the scenario's notes instead (e.g., "missing unit test for session invalidation — recommend adding coverage").

During execution, you may discover behaviors that should have formal test coverage but don't — an edge case with no unit test, a behavior path with no integration test, an untested integration point. Default: document the gap in the scenario's notes as a coverage recommendation (e.g., "missing unit test for session invalidation — recommend adding coverage"). QA's primary job is scenario verification, not chasing coverage — surfacing gaps is valuable even when filling them isn't in scope, and documented gaps flow to /pr as follow-up items.

Exception — write the test only WHEN all of these hold:

  • The gap is directly adjacent to a bug you just fixed (pair the test with the fix).
  • The test can be written at the tier of existing nearby tests (no new test infrastructure required).
  • Writing takes under 5 minutes.

When writing under the exception, load /tdd for test-design rules — tier selection, mocking philosophy, flakiness handling, mock-tautology prevention, test-artifact protection, spec-grounded authoring. Record the test in the scenario's notes alongside the bug fix notes (e.g., "also wrote unit test for session invalidation — no existing coverage").

Step 8: Report and teardown

When tmp/ship/ exists: As a final action before reporting:

  1. Write qaCompletedAtCommit to qa-progress.json with the current HEAD commit hash (git rev-parse HEAD). This marks the boundary between QA and post-QA changes for staleness detection.

  2. Compute and write executionSummary to planMetadata:

    "executionSummary": {
      "validated": 12,
      "failed": 1,
      "blocked": 2,
      "planned": 0
    }
    

    Count scenarios by status. This saves every downstream consumer from scanning the full scenario array.

  3. Compute and write verdict at the top level of qa-progress.json. Verdict severity is derived from scenario properties (category, source), not from planner-assigned priority labels. No failure ever produces a clean "go":

    • Any scenario with source: "journey" failed or blocked"no-go" (compositional user path broken — integration seams failing)
    • Any scenario with category: "ux-flow" tracing to a core user journey failed or blocked"no-go" (happy path broken)
    • Any scenario involving data-loss or security-sensitive paths failed or blocked"no-go" (safety-critical failure)
    • Any other scenario failed or blocked"conditional" (feature works but has documented issues — human decides whether to proceed)
    • All validated"go"

    Write the verdict alongside the inputs that produced it (which scenarios triggered the verdict and why) so consumers can both use the pre-computed verdict and verify the computation.

The JSON file is your report. A downstream consumer will render it to the PR body. Report completion to the invoker.

When tmp/ship/ does not exist and a PR exists: The ## Verification section in the PR body is your report. Ensure it's up-to-date with all results. Do not add a separate PR comment.

No PR exists: Report directly to the user with:

  • Total scenarios tested vs. passed vs. failed vs. blocked
  • Bugs found and fixed (with brief description of each)
  • Gaps — what could NOT be tested due to tool limitations or environment constraints
  • Judgment call — your honest assessment: is this feature ready for human review?

The skill's job is to fix what it can, document what it found, and hand back a clear picture. Unresolvable issues and gaps are documented, not silently swallowed — but they do not block forward progress. The invoker decides what to do about remaining items.

Teardown (mandatory): After reporting, tear down everything bootstrapped in Step 2. Kill dev server, stop Docker containers (docker compose down -v), clean fixture data, remove temporary files in tmp/. Tear down in reverse order of bootstrap. Consult bootstrapResult.teardownRequired for the full list. Leave the environment as it was found.


QA execution boundaries

The one hard boundary: no mutations to cloud/external systems. Everything else is fair game locally. Exhaust all local options before marking anything as unverifiable.

In bounds — exhaust these (the full local arsenal):

  • Browser automation (Playwright via /browser) — navigate, click, fill forms, inspect console/network, record video, audit accessibility, test responsive layouts
  • Docker containers — spin up databases, caches, queues, mock services via docker compose up -d. Tear them down when done.
  • Local dev servers — install deps, start the server, seed the database, start workers
  • Shell scripts, REPL sessions, consumer-perspective import scripts
  • API calls to locally-running endpoints (curl localhost:...)
  • Ad-hoc verification scripts in tmp/ — write, run, delete
  • Temporary test servers, fixture files, seed data
  • Environment variable manipulation, feature flag toggling
  • Running the project's own test suite, linters, typecheckers
  • Database queries to verify mutation correctness
  • Log/telemetry inspection during test flows
  • Simulating external service responses locally (MSW, nock, VCR, intercepting HTTP)
  • Testing webhook handlers by sending simulated payloads to localhost
  • Verifying outbound HTTP calls are correctly formed (intercept and assert, don't send)

Out of bounds (the hard boundary):

  • Requests to production URLs, staging environments, or live third-party APIs
  • Operations that trigger billing, metering, or quota consumption
  • Sending real emails, Slack messages, webhooks to external services
  • Accessing production databases or customer data
  • Any mutation to external/cloud systems (POST/PUT/DELETE to non-localhost)

When a scenario requires an external service:

  1. First, check if there's a local substitute: Docker image, local emulator (e.g., LocalStack for AWS, Stripe CLI for webhooks), mock server, or simulated payload
  2. If a local substitute exists — use it. This counts as real verification.
  3. If no local substitute exists — verify as much as possible locally (e.g., verify the webhook handler responds correctly to a simulated payload), then mark the scenario status: "blocked" with notes describing what you verified locally and what a human needs to verify in staging/production
  4. blocked scenarios flow to /pr as pending human verification items — they are NOT silently dropped

blocked is the safety net, not the first resort. A scenario should only be blocked after you've exhausted Docker, local emulators, simulated payloads, and intercepted requests. If you can verify 80% of the scenario locally and only the final external handoff needs human eyes, describe the 80% you verified and the 20% the human needs to check. Every blocked scenario must include what was attempted.


Calibrating depth

The depth comes from the qa-progress.json plan — execute every scenario in it. The plan was written to be maximally ambitious; your job is to verify every scenario at the highest achievable fidelity.

Testing breadth scales with affected surfaces. When standalone (no qa-progress.json), calibrate how many scenarios to test based on the scope of changes: a single-file bug fix warrants 1-2 targeted scenarios plus a regression check; a multi-file feature touching several surfaces warrants scenarios for each affected surface plus edge cases and error paths. Don't apply a fixed number — let the number of affected surfaces drive the count.

Under-testing looks like (these are the failures to avoid):

  • Declaring confidence from unit tests alone when the feature has user-facing surfaces
  • Claiming a scenario is "covered by tests" when those tests mock the service boundary
  • Never opening a browser when the feature has a UI — /browser is always loadable and runs headless Playwright by default
  • Skipping error-path testing
  • Not testing the interaction between new and existing code
  • Not checking database state after mutations
  • Not spinning up Docker when docker-compose.yml exists and would give you a real database
  • Marking scenarios as blocked without first loading /browser, starting the dev server, spinning up Docker, or trying local emulators, simulated payloads, and ad-hoc scripts

Anti-patterns

  • Treating QA as a checkbox. "I tested it" means nothing without specifics. Every scenario must have a concrete action and expected outcome.
  • Only testing the happy path. Real users encounter errors, edge cases, and unexpected states. Test those.
  • Silent gaps. If you can't test something, say so explicitly. An undocumented gap is worse than a documented one.
  • Confusing orchestration headless mode with browser unavailability. Ship's --headless flag and container detection enter orchestration autonomy — no user gates, no negotiation checkpoints. They do NOT mean "the environment cannot run a browser." /browser's Playwright engine runs headless Chromium by default, which is exactly what --headless ship runs need. Load /browser, run the scenarios.
  • Marking scenarios as blocked for "needs browser", "needs Playwright", "needs dev server", "requires UI", or "unrunnable in headless". These are the primary use cases for /browser + the Step 2 bootstrap — not valid blocked reasons. If you wrote one of these as a blocked reason, stop: you haven't actually attempted to load /browser or start the dev server. Do that first, then re-evaluate.
  • Using Peekaboo or Claude-in-Chrome for web testing. mcp__peekaboo__* and mcp__claude-in-chrome__* tools are NOT for QA web page testing. Use /browser (Playwright). Peekaboo is for OS-level macOS automation only. Chrome extension is for ad-hoc, user-directed browser tasks outside QA.
Related skills

More from inkeep/team-skills

Installs
61
GitHub Stars
10
First Seen
Feb 25, 2026