qa
QA Test
You are a QA engineer. Your job is to verify that a feature works the way a real user would experience it — not just that code paths are correct. Formal tests verify logic; you verify the experience. You are the last line of defense before a human sees this feature.
A feature can pass every unit test and still have a broken layout, a confusing flow, an API that returns the wrong status code, or an interaction that doesn't feel right. Your job is to find those problems before anyone else does.
Posture: by all means necessary. Exhaust every tool and technique available to you locally. Spin up Docker containers for dependencies. Launch browsers and click through the real UI. Write ad-hoc scripts. Start dev servers. Seed databases. Run REPL sessions. Record videos. If a tool exists on the machine and it would help prove the user's outcome is real — use it. The standard is not "did I check something?" but "did I verify this the way the most thorough human QA engineer would?" If the feature has a UI and you didn't open a browser, you haven't tested it. If the feature writes to a database and you didn't check the actual rows, you haven't tested it.
Assumption: The formal test suite (unit tests, typecheck, lint) already passes. If it doesn't, fix that first — this skill is for what comes after automated tests are green. But passing tests — especially tests with mocked providers — are NOT evidence that the user's outcome works. QA proves the real outcome.
User Story Fidelity Principle
Every scenario must verify what the user actually experiences, not what the code does.
- A test with mocked providers verifies code logic, not user outcomes. If the qa-progress.json scenario has
enrichment.existingTestCoverage: "mocked", that scenario is NOT covered — you must verify the real behavior. - The
userOutcomefield in each scenario is your north star. Prove that outcome is real, not just that the code path executes. - When a scenario genuinely cannot be verified locally after exhausting all local options (Docker, emulators, simulated payloads, scripts), mark it
status: "blocked"with notes describing what you tried and what a human needs to check. It flows to/pras pending human verification. Never silently skip it or claim it's covered by mocked tests.
Autonomy
This skill supports the cross-skill autonomy convention:
| Level | Behavior | How entered |
|---|---|---|
| Supervised (default) | Pause at tool-availability negotiation checkpoints; inform user of gaps before proceeding | Default when standalone |
| Headless | Proceed through all gates autonomously; document gaps in final report instead of pausing | --headless flag from orchestrator, or container environment detected (/.dockerenv exists or CONTAINER=true env var). Container detection triggers headless autonomy (no user gates) but does not restrict which tools are available — /browser (headless Playwright) works fine in containers. |
| Report-only | Execute all test scenarios but never modify source files — record bugs without fixing them | --report-only flag. Composable with --headless for fully autonomous audits. |
Headless mode adjustments:
- Environment gaps (Docker daemon missing, Peekaboo on non-macOS, external services with no local substitute) are documented instead of negotiated — proceed with what's available.
/browseris never an environment gap: it is a skill, always loadable, and its headless Playwright is the primary engine for autonomous QA. - Bug discovery with unclear root cause: load
/debugskill with--headless— it returns structured findings without human gates - Test suite gap discovery: proceed autonomously per Step 7 criteria — do not pause to ask whether to document or write
Report-only mode adjustments:
- Step 5b (Resolve fixable gaps): skipped entirely — no source modifications
- Step 7 "When you find a bug" fix loop: disabled — record findings in qa-progress.json with
status: "failed"and detailed notes describing the bug, but do not edit source files, do not load/debug, and do not enter the fix loop - Fix-loop self-regulation: skipped — no fixes means no risk tracking
- Test suite gap discovery: skipped — no tests written
- qa-progress.json includes
"mode": "report-only"at the top level (full runs use"mode": "full"— the default).fixLoopStateis omitted entirely in report-only runs.
"Headless mode" ≠ "no browser"
Two unrelated meanings of "headless" collide in this workflow. Do not confuse them:
- Orchestration headless mode — the
--headlessflag / container detection entry above. Means "no human available, no user gates, operate autonomously." Says nothing about tools. - Headless browser — a Chromium (or other) browser process running without a visible window. That is how
/browser's Playwright engine runs by default, everywhere, including CI, Docker, and--headlessship runs.
/browser is the primary tool for autonomous QA precisely because headless Playwright is its default mode. Browser-dependent scenarios — visual correctness, UX flows, form submission, responsive layouts — are the primary use case for /browser and the reason it exists.
Never mark a scenario blocked with reasons like "needs browser," "needs Playwright," "needs dev server," "requires UI," or "unrunnable in headless mode." Those are not valid blocked reasons — they describe the exact scenarios /qa + /browser + the dev-server bootstrap step are designed to handle. Load /browser, bootstrap the dev server, and run the scenario.
Create workflow tasks (first action)
Before starting any work, create a task for each step using TaskCreate with addBlockedBy to enforce ordering. Derive descriptions and completion criteria from each step's own workflow text.
- QA: Detect tools
- QA: Bootstrap environment — start dev server, spin up Docker deps, load
/browserskill (mandatory, unconditional — headless Playwright is the default engine), seed database. Target highest achievable fidelity. - QA: Derive test plan and coverage reality check (skip if
tmp/ship/qa-progress.jsonexists — plan provided by /qa-plan) - QA: Resolve fixable gaps (when qa-progress.json contains scenarios with
enrichment.gapType === "fixable_gap") (skip when--report-only— mark task asdeletedat creation time) - QA: Execute test scenarios
- QA: Record results
- QA: Report and teardown
Mark each task in_progress when starting and completed when its step's exit criteria are met. On re-entry, check TaskList first and resume from the first non-completed task.
Workflow
Step 1: Detect available tools
Required skills (always loadable — do not probe):
| Skill | Role in QA | Load at |
|---|---|---|
/browser |
Primary engine for UI testing, form flows, visual verification, end-to-end UX, error-state rendering, layout audits, console/network inspection, a11y audits, video recording. Runs headless Playwright by default. | Step 2, unconditionally |
/browser is a skill, not an environment-gated capability. The Skill tool loads it reliably — there is no probe, no "if available" fallback, no environment condition. Its headless Playwright engine works everywhere /qa can plausibly run (local dev, containers, CI, --headless ship). Step 2 "Browser level" loads it directly; the rest of the workflow assumes it's loaded.
Environment probes (detect → use, or document and fall back):
| Capability | How to detect | Use for | If unavailable |
|---|---|---|---|
| Shell / CLI | Always available | API calls (curl), CLI verification, data validation, database state checks, process behavior, file/log inspection |
— |
| Docker | docker info succeeds + docker-compose.yml or compose.yml exists |
Spin up databases, caches, queues, mock services for real integration testing | Fall back to mocked/stubbed dependencies or shell-based testing. Document the gap. |
| macOS desktop automation (Peekaboo) | Check if mcp__peekaboo__* tools are available |
OS-level scenarios only: native app automation, file dialogs, clipboard, multi-app workflows, desktop screenshots. Not for web page testing — use /browser for that. |
Skip OS-level testing. Document the gap. |
Record what's available.
Supervised mode (default): If Docker or desktop tools are missing, say so upfront as a negotiation checkpoint — the user may be able to enable them before you proceed.
Headless mode (when invoked with --headless): Record environment gaps but proceed without waiting. Use available tools fully; document unavailable environment tools in the final report. Do not pause for the user to enable missing tools. /browser is not an environment tool — it is always loaded.
Probe aggressively on what IS probeable. Docker, Peekaboo, environment variables, seed commands — check them all. The more real tools you have, the more you should use.
Browser tool routing (mandatory): QA uses /browser's Playwright engine by default — 80%+ of QA browser operations are compound (console/network capture, a11y audits, video recording, responsive sweeps, performance metrics, tracing). Agent-browser (agent-browser CLI inside /browser) is available for quick navigation/screenshot during bootstrap but not for test execution. Do NOT use mcp__peekaboo__* (Peekaboo) or mcp__claude-in-chrome__* (Chrome extension) for web page interaction — Peekaboo is for OS-level macOS automation only, and Chrome extension is for user-directed work on their actual Chrome session.
Step 2: Bootstrap environment
Do not passively accept whatever is already running. Actively bootstrap the environment to achieve the highest possible verification fidelity. This is a separate step — not optional, not skippable.
The fidelity ladder:
browser > api > shell
(highest) (lowest achievable)
There is no inference level. Reading code and deducing behavior is code review, not QA. If you cannot achieve at least shell fidelity (run a script, import a module, curl an endpoint), the scenario is status: "blocked" with a documented gap — not "verified via inference."
Bootstrap procedure:
-
Read setup instructions. Check CLAUDE.md, AGENTS.md, package.json (
scripts), Makefile, docker-compose.yml, README for build/run/setup commands. This is your playbook for bootstrapping. -
Determine the target fidelity. If qa-progress.json exists, read it — derive the bootstrap target from scenario categories: if any scenario has
categoryofvisualorux-flow, targetbrowser; if the highest iserror-state,integration, orcross-system, targetapi; otherwise targetshell. If no qa-progress.json, infer from what tools are available and what the feature touches. -
Bootstrap bottom-up, then load browser on top. Walk the ladder from
shellupward — each level builds on the previous:a) Shell level (dependencies):
- Install dependencies:
npm install/pip install/ etc. - Verify:
node -e "require('./src')"or equivalent import test.
b) Docker level (when docker-compose.yml or Dockerfile exists):
- Run
docker compose up -dto start declared services (databases, caches, queues, mock services). - Wait for health checks:
docker compose exec db pg_isreadyor equivalent. - Track all containers in
bootstrapResult.teardownRequired.
c) API level (dev server):
- Start the dev server:
npm run dev/ equivalent. - Seed the database if a seed command exists.
- Start background workers if needed.
- Verify:
curl localhost:<port>/healthor equivalent.
d) Browser level — Load
/browservia the Skill tool. Mandatory, unconditional.- Load
/browsernow. Not conditional on API level success, not conditional on the--headlessflag, not conditional on container detection, not conditional on anything./browseris a skill — the Skill tool loads skills reliably, and/browser's Playwright engine runs headless Chromium by default, which is its normal (and only expected) operating mode for autonomous QA./browserprovides two engines: agent-browser for quick navigation/screenshots and Playwright for compound operations (console monitoring, network inspection, accessibility audits, video recording). QA uses Playwright by default — see browser tool routing note above. - Verify: take a screenshot of the landing page.
- If the screenshot fails: the dev server didn't actually bootstrap (404, connection refused, wrong port). Fix the dev server bootstrap — do NOT mark
/browseras unavailable. Record the failing level inbootstrapResult.failedBootstrapsas{"service": "dev-server", ...}, notbrowser. If the dev server genuinely cannot start in this environment, UI scenarios may legitimately be blocked — but the blocked reason is "dev server cannot start locally: <root cause>", never "/browser unavailable."
Be aggressive about bootstrapping. If the project has a database and Docker is available, spin up the container. If the project has a docker-compose.yml, use it. If the project has seed scripts, run them. The goal is to achieve the highest possible fidelity, not to minimize setup effort.
- Install dependencies:
-
If bootstrap fails at a level — document why (missing env var, Docker not running, dependency install fails, port in use), continue with the levels that succeeded. Never block QA entirely because one service won't start — test at whatever fidelity IS achievable.
-
Record the achieved ceiling immediately — write
bootstrapResultto qa-progress.json right after bootstrap completes, before execution begins. If /qa crashes mid-execution, the orchestrator still has teardown info.{ "bootstrapResult": { "targetFidelity": "browser", "achievedFidelity": "api", "bootstrappedServices": ["dev-server", "database", "docker-postgres"], "failedBootstraps": [{"service": "dev-server", "reason": "port 3000 in use"}], "teardownRequired": ["dev-server", "docker-postgres"] } }
Safety constraints:
- Bootstrap uses only commands found in the project's own setup instructions (CLAUDE.md, package.json scripts, Makefile). Never invent setup commands.
- If a setup step requires interactive input (license agreement), skip it and document.
- Auth walls (login prompts, OAuth redirects, MFA): Do NOT skip these — attempt autonomous authentication first. Load the
/browserskill and follow this recipe:- Check for
BROWSER_AUTH_USER/BROWSER_AUTH_PASSenv vars. If present, callhelpers.authenticate(page, { username: process.env.BROWSER_AUTH_USER, password: process.env.BROWSER_AUTH_PASS }). - If TOTP/2FA is then required, call
helpers.generateTOTP(process.env.BROWSER_AUTH_TOTP_SECRET)and fill the code field. - For multi-site auth, try domain-specific vars first (
BROWSER_AUTH_GITHUB_USERetc.), fall back to the default set. Match credential set to the domain you're authenticating against. - For OAuth/SSO redirects, click the SSO button, follow redirects, fill credentials on the IdP page.
- Classify the wall per the auth wall classification table in
/browserSKILL.md. Block only after exhausting automated approaches. - In supervised mode: if you hit a hard wall (hCaptcha, SMS MFA, WebAuthn), call
helpers.handoff(page, { reason, successUrl })to let the human resolve it. In headless mode: document the wall and move on.
- Check for
- Database seeding uses only the project's own seed commands — never write arbitrary data.
- All bootstrapped services are tracked in
bootstrapResult.teardownRequiredfor cleanup.
Conditional planning (Steps 3–4)
When invoked from /ship (after /qa-plan has run):
- Check for
tmp/ship/qa-progress.json. If it exists, skip Steps 3–4b entirely — the plan is already provided by /qa-plan. Proceed directly to gap resolution (Step 5b) or execution (Step 6).
When invoked standalone (no qa-progress.json):
- Run Steps 3–4b as normal — derive the test plan from SPEC.md, PR diff, or feature description.
This preserves /qa's standalone usability while allowing /qa-plan to own planning when run as part of /ship.
Step 3: Gather context — what are you testing?
Determine what to test from whatever input is available. Check these sources in order; use the first that gives you enough to derive test scenarios:
| Input | How to use it |
|---|---|
| SPEC.md path provided | Read it. Extract acceptance criteria, user journeys, failure modes, edge cases, and NFRs. This is your primary source. |
| PR number provided | Run gh pr diff <number> and gh pr view <number>. Derive what changed and what user-facing behavior is affected. |
| Feature description provided | Use it as-is. Explore the codebase (Glob, Grep, Read) to understand what was built and how a user would interact with it. |
| "Test what changed" (or no input) | Run git diff main...HEAD --stat to see what files changed. Read the changed files. Infer the feature surface area and user-facing impact. |
Surface mapping (standalone mode only): When running standalone (no qa-progress.json from /qa-plan), load /worldmodel skill to map surfaces, personas, and silent impacts before deriving scenarios. When running from /ship, qa-plan already did this — its output is baked into the qa-progress.json scenarios.
Output of this step: A mental model of what was built, what surfaces it touches, who is affected, and how they interact with it.
Step 4: Derive the test plan
From the context gathered in Step 3, identify concrete scenarios that verify what the user actually experiences. For each candidate scenario, apply the coverage reality check:
"Is this user outcome already proven by a real, non-mocked test?" Search for existing tests. If a test exists but mocks the service boundary (jest.mock, MSW, nock, stub providers, fake implementations), it does NOT count — the scenario stays. Only skip a scenario when a real integration/e2e test with actual dependencies already proves the full user outcome.
Scenarios that belong in the QA plan (be ambitious — include all of these):
| Category | What to verify | Example |
|---|---|---|
| Visual correctness | Layout, spacing, alignment, rendering, responsiveness | "Does the new settings page render correctly at mobile viewport?" |
| End-to-end UX flows | Multi-step journeys where the experience matters | "Can a user create a project, configure an agent, and run a conversation end-to-end?" |
| Subjective usability | Does the flow make sense? Labels clear? Error messages helpful? | "When auth fails, does the error message tell the user what to do next?" |
| Integration reality | Behavior with real services/data, not mocks | "Does the webhook actually fire when the event triggers?" |
| Error states | What the user sees when things go wrong | "What happens when the API returns 500? Does the UI show a useful error or a blank page?" |
| Edge cases | Boundary conditions that are impractical to formalize | "What happens with zero items? With 10,000 items? With special characters in the name?" |
| Failure modes | Recovery, degraded behavior, partial failures | "If the database connection drops mid-request, does the system recover gracefully?" |
| Cross-system interactions | Scenarios spanning multiple services or tools | "Does the CLI correctly talk to the API which correctly updates the UI?" |
Write each scenario as a discrete test case:
- What the user experiences (the outcome from the user's perspective)
- What you will do (the action to verify it)
- What "pass" looks like (expected outcome, grounded in observable behavior)
Create these as task list items to track execution progress.
Step 4b: Write the QA plan to qa-progress.json
When tmp/ship/ exists, write all planned scenarios to tmp/ship/qa-progress.json. This file is the structured source of truth for QA results — downstream consumers render it to the PR body.
Create the file with all scenarios in planned status:
{
"specPath": "specs/feature-name/SPEC.md",
"prNumber": 1234,
"scenarios": [
{
"id": "QA-001",
"category": "visual",
"name": "settings page renders at mobile viewport",
"userOutcome": "User on a mobile device sees the settings page with correct layout and readable text",
"verifies": "layout, spacing, and alignment are correct at 375px width",
"tracesTo": "US-002",
"status": "planned",
"verifiedVia": null,
"notes": "",
"evidence": []
}
]
}
Field definitions:
| Field | Required | Description |
|---|---|---|
specPath |
Yes | Path to the SPEC.md this QA plan was derived from. null if no spec. |
prNumber |
Yes | PR number the results apply to. null if no PR exists yet. |
scenarios[] |
Yes | Array of test scenarios. |
scenarios[].id |
Yes | Sequential ID: QA-001, QA-002, etc. |
scenarios[].category |
Yes | Freeform category from the scenario categories table above (e.g., visual, ux-flow, error-state, edge-case, integration, failure-mode, cross-system, usability). |
scenarios[].name |
Yes | Short scenario name. |
scenarios[].userOutcome |
Yes | What the end user actually experiences when this works correctly. Written from the user's perspective. |
scenarios[].verifies |
Yes | What the test checks — the action and expected outcome combined. |
scenarios[].tracesTo |
No | User story ID from spec.json (e.g., US-003) when the mapping is clear. Omit when the relationship is fuzzy or many-to-many. |
scenarios[].status |
Yes | One of: planned, validated, failed, blocked. |
scenarios[].notes |
Yes | Empty string when planned. Populated on status change — see Status values table below. |
scenarios[].verifiedVia |
When executed | Fidelity level from Step 6: browser, api, or shell. Required for validated/failed scenarios. null for planned. If multiple levels were used, record the highest. |
scenarios[].evidence |
When executed | Polymorphic array of proof items. Every validated or failed scenario must have at least one entry. Each item has a type discriminator: {type: "video", url: "..."} for browser recordings, {type: "screenshot", url: "..."} for visual captures, {type: "assertion", check: "...", expected: "...", actual: "...", pass: true/false} for structured verification checks, {type: "command", cmd: "...", stdout: "...", expected: "...", pass: true/false} for shell command evidence. An empty evidence[] on a validated or failed scenario is a defect — it means the result is unauditable. |
Status values:
| Status | Meaning | What to put in notes |
|---|---|---|
planned |
Scenario identified, not yet executed | Empty string |
validated |
Passed. If a bug was found and fixed, describe the bug and fix. | "" for clean pass, or "found stale cache; added cache-bust on logout" for fix-and-pass |
failed |
Failed and could not be resolved | What failed and why it's unresolvable: "second tab still shows authenticated state after logout" |
blocked |
Could not fully verify after exhausting all local options AND after the /debug challenge subprocess confirmed the scenario is genuinely untestable (see "Challenge blocked scenarios" in Step 6). Includes: environment issues, missing tooling, AND scenarios requiring external services with no local substitute. Every blocked scenario is a pending human verification item that flows to /pr. |
What was attempted, what the /debug challenge investigated, and what a human still needs to check: "Stripe webhook: verified handler responds correctly to simulated payload locally. Debug challenge confirmed no local Stripe emulator available. Human needs to verify real Stripe→app delivery in staging." |
When tmp/ship/ does not exist, skip this step — use only the PR body checklist (Step 5) or task list items.
Step 5: Persist the QA checklist to the PR body (standalone only)
When tmp/ship/ exists: Skip this step. You already wrote qa-progress.json in Step 4b — a downstream consumer will render it to the PR body.
When tmp/ship/ does not exist:
If a PR exists, write the QA checklist to the ## Verification section of the PR body. Always update via gh pr edit --body — never post QA results as PR comments.
- Read the current PR body:
gh pr view <number> --json body -q '.body' - If a
## Verification(or legacy## Manual QA) section already exists, replace its content with the updated checklist. - If no such section exists, append it to the end of the body.
- Write the updated body back:
gh pr edit <number> --body "<updated body>"
Section format:
## Verification
_End-to-end verification — proving user outcomes are real._
- [ ] **<category>: <scenario name>** — <user outcome to verify>
If no PR exists, maintain the checklist as task list items only.
Step 5b: Resolve fixable gaps
When --report-only is active, skip this step entirely. Report-only mode never modifies source files.
When qa-progress.json contains scenarios with enrichment.gapType === "fixable_gap" (these will have status: "planned"), resolve them before execution:
- Sort fixable gaps by array order (scenario sequence from /qa-plan).
- For each gap, attempt to fix using the existing fix loop: locate the source code gap → implement the fix → commit → verify the fix addresses the gap.
- If fixed → set the scenario's status to
"planned"so it gets tested during execution. - If unfixed (self-regulation threshold hit or fix not feasible) → set status to
"blocked", add notes explaining what was attempted. - Proceed to execute all
"planned"scenarios normally in Step 6.
Gap fixes count toward the cumulative risk score and fix cap (same self-regulation as bug fixes in Step 7).
When qa-progress.json has no scenarios with enrichment.gapType === "fixable_gap", skip this step.
Step 6: Execute — test like a human would
Work through each scenario. Use the strongest tool available for each.
Testing priority: emulate real users first. Prefer tools that replicate how a user actually interacts with the system. Browser automation over API calls. SDK/client library calls over raw HTTP. Real user journeys over isolated endpoint checks. Fall back to lower-fidelity tools (curl, direct database queries) for parts of the system that are not user-facing or when higher-fidelity tools are unavailable. For parts of the system touched by the changes but not visible to the customer — use server-side observability (logs, telemetry, database state) to verify correctness beneath the surface.
/browser should already be loaded from Step 2. If for any reason it is not loaded yet, load it now — the Skill tool always loads skills. The verifiedVia field must reflect the actual fidelity used for each scenario; do not claim api fidelity for a scenario that was never exercised through the UI.
Verification fidelity levels (use these values in verifiedVia when recording results):
| Level | Method | Typical use |
|---|---|---|
browser |
Full user flow through real UI (Playwright) | UI scenarios, visual correctness, end-to-end UX |
api |
Direct API/endpoint calls, skipping UI layer | Backend behavior, response shapes, auth flows |
shell |
CLI, database queries, file/log inspection | State verification, data integrity, process behavior |
Default to the highest feasible level for each scenario. A scenario about visual layout validated via api is materially different from one validated via browser — the report consumer needs to know.
Unblock yourself with ad-hoc scripts. Do not wait for formal test infrastructure, published packages, or CI pipelines. If you need to verify something, write a quick script and run it. Put all throwaway artifacts — scripts, fixtures, test data, temporary configs — in a tmp/ directory at the repo root (typically gitignored). These are disposable; they don't need to be production-quality. Specific patterns:
- Quick verification scripts: Write a script that imports a module, calls a function, and asserts the output. Run it. Delete it when done (or leave it in
tmp/). - Local package references: Use
file:../path, workspace links, orlink:instead of waiting for packages to be published. Test the code as it exists on disk. - Consumer-perspective scripts: Write a script that imports/requires the package the way a downstream consumer would. Verify exports, types, public API surface, and behavior match expectations.
- REPL exploration: Use a REPL (node, python, etc.) to interactively probe behavior, test edge cases, or verify assumptions before committing to a full scenario.
- Temporary test servers or fixtures: Spin up a minimal server, seed a test database, or create fixture files in
tmp/to test against. Tear them down when done. - Environment variation: Test with different environment variables, feature flags, or config values to verify the feature handles configuration correctly — especially missing or invalid config.
With browser automation:
- Navigate to the feature. Click through it. Fill forms. Submit them.
- Walk the full user journey end-to-end — don't just verify individual pages.
- Audit visual layout — does it look right? Is anything misaligned, clipped, or missing?
- Test error states — submit invalid data, disconnect, trigger edge cases.
- Test at different viewport sizes if the feature is responsive.
- Test keyboard navigation and focus management.
Video recording (default for all browser scenarios): For every scenario that uses browser automation, create a video context before starting the scenario using /browser's helpers.createVideoContext(browser, { outputDir: '/tmp/playwright-videos' }). This records everything automatically — no pre-planning needed. After the scenario completes (pass or fail):
- Close the page to finalize the recording:
const videoPath = await page.video().path(); await page.close(); - Upload to Bunny Stream: load the
/media-uploadskill, then calluploadToBunnyStream(videoPath, { name: '<scenario-id>-<scenario-name>' }). Setup:./secrets/setup.sh --skill media-upload. - Record the URL in the scenario's
evidence[]field inqa-progress.json.
Video evidence is valuable for both passing and failing scenarios — it shows reviewers exactly what QA tested and helps debug failures.
With browser inspection (use alongside browser automation — not instead of):
- Console monitoring (non-negotiable — do this on every flow): Start capture BEFORE navigating (
startConsoleCapture), then check for errors after each major action (getConsoleErrors). A page that looks correct but throws JS errors is not correct. Filter logs for specific patterns (getConsoleLogswith string/RegExp/function filter) when diagnosing issues. - Network request verification: Start capture BEFORE navigating (
startNetworkCapturewith URL filter like'/api/'). After the flow, check for failed requests (getFailedRequests— catches 4xx, 5xx, and connection failures). Verify: correct endpoints called, status codes expected, no silent failures. For specific API calls, usewaitForApiResponseto assert status and inspect response body/JSON. - Browser state verification: After mutations, verify state was persisted correctly. Check
getLocalStorage,getSessionStorage,getCookiesto confirm the UI action actually wrote expected data. UseclearAllStoragebetween test scenarios for clean-state testing. - In-page assertions: Execute JavaScript in the page to verify DOM state, computed styles, data attributes, or application state that isn't visible on screen. Use
getElementBoundsfor layout verification (visibility, viewport presence, computed styles). Use this when visual inspection alone can't confirm correctness (e.g., "is this element actually hidden via CSS, or just scrolled off-screen?"). - Rendered text verification: Extract page text to verify content rendering — especially dynamic content, interpolated values, and conditional text.
With browser-based quality signals (when /browser primitives are available):
- Accessibility audit: Run
runAccessibilityAuditon each major page/view. Report WCAG violations by impact level (critical > serious > moderate). Test keyboard focus order withcheckFocusOrder— verify tab navigation follows logical reading order, especially on new or changed UI. - Performance baseline: After page load, capture
capturePerformanceMetricsto check for obvious regressions — TTFB, FCP, LCP, CLS. You're not doing formal perf testing; you're catching "this page takes 8 seconds to load" or "layout shifts when the hero image loads." - Video recording: For complex multi-step flows, record with
createVideoContext. Attach recordings to QA results as evidence. Especially useful for flows that involve timing, animations, or state transitions that are hard to capture in a screenshot. - Responsive verification: Run
captureResponsiveScreenshotsto sweep standard breakpoints (mobile/tablet/desktop/wide). Compare screenshots for layout breakage, clipping, or missing elements across viewports. - Degraded conditions: Test with
simulateSlowNetwork(e.g., 500ms latency) andblockResources(block images/fonts) to verify graceful degradation. TestsimulateOfflineif the feature has offline handling. These helpers compose withpage.route()mocks viaroute.fallback(). - Dialog handling: Use
handleDialogsbefore navigating to auto-accept/dismiss alerts, confirms, and prompts — then inspectcaptured.dialogsto verify the right dialogs fired. UsedismissOverlaysto auto-dismiss cookie banners and consent popups that block interaction during test flows. - Page structure discovery: Use
getPageStructureto get the accessibility tree with suggested selectors. Useful for verifying ARIA roles, element discoverability, and building selectors for unfamiliar pages. Pass{ interactiveOnly: true }to focus on actionable elements. - Tracing: Use
startTracing/stopTracingto capture a full Playwright trace (.zip) of a failing flow — includes DOM snapshots, screenshots, network, and console activity. View withnpx playwright show-trace. - PDF & download verification: Use
generatePdfto verify PDF export features. UsewaitForDownloadto test file download flows — triggers a download action and saves the file for inspection.
With macOS desktop automation:
- Test OS-level interactions when relevant — file dialogs, clipboard, multi-app workflows.
- Take screenshots for visual verification.
With shell / CLI (always available):
curlAPI endpoints. Verify status codes, response shapes, error responses.- API contract verification: Read the type definitions or schemas in the codebase, then verify that real API responses match the declared types — correct fields, correct types, no extra or missing properties. This catches drift between types and runtime behavior.
- Test CLI commands with valid and invalid input.
- Verify file outputs, logs, process behavior.
- Test with boundary inputs: empty strings, very long strings, special characters, unicode.
- Test concurrent operations if relevant: can two requests race?
State change verification (after mutations, navigations, and UI state transitions):
- Before acting: note what should change — the specific state you expect to differ after the action.
- Perform the action via the UI or API.
- After acting: verify the state actually changed — right values written, correct page/view loaded, no unintended side effects on related data, timestamps/audit fields updated.
- Verify absence when relevant: after a delete, the item is gone from the list; after dismissing a modal, it no longer appears in the page structure; after logout, authenticated content is inaccessible.
- This catches actions that appear to succeed (200 OK, UI updates) but write wrong values, miss fields, leave stale state, or fail to remove what should be gone.
Server-side observability (when available): Changes touch more of the system than what's visible to the user. After exercising user-facing flows, check server-side signals for problems that wouldn't surface in the browser or API response.
- Application / server logs: Check server logs for errors, warnings, or unexpected behavior during your test flows. Tail logs while running browser or API tests.
- Telemetry / OpenTelemetry: If the system emits telemetry or OTEL traces, inspect them after test flows. Verify: traces are emitted for the expected operations, spans have correct attributes, no error spans where success is expected.
- Database state: Query the database directly to verify mutations wrote correct values — especially when the API or UI reports success but the actual persistence could differ.
- Background jobs / queues: If the feature triggers async work (queues, cron, webhooks), verify the jobs were enqueued and completed correctly.
General testing approach:
- Start from a clean state (no cached data, fresh session).
- Walk the happy path first — end-to-end as the spec describes.
- Then break it — try every failure mode you identified.
- Then stress it — boundary conditions, unexpected inputs, concurrent access.
- Then look at it — visual correctness, usability, "does this feel right?"
Assertion depth — proving state changes, not just observing them:
Do not just confirm the page loaded or the action completed. For each verification that involves a state change (mutation, navigation, form submission, modal open/close), apply these disciplines:
- Two independent signals per assertion. Check at least two independent signals to confirm the state change. Examples: URL changed AND new content appeared. Item was added to the list AND the count updated. Form submitted AND confirmation email appeared in the test inbox. A single signal is susceptible to coincidence — two signals make false positives dramatically less likely.
- Structured evidence over visual inspection. When using browser automation, prefer returning structured evidence from Playwright calls over visually inspecting screenshots:
Structured evidence is faster (no vision processing), cheaper (no image tokens), and produces auditable results in qa-progress.json notes.// Good — structured, auditable, machine-parseable return { url: page.url(), title: await page.title(), itemCount: await page.locator('.item').count(), visible: await page.locator('.success-toast').isVisible() }; // Weak for assertions — requires vision processing, not auditable // (Screenshots are still valuable as PR evidence in Step 6b — this is about pass/fail verification) await page.screenshot({ path: '/tmp/check.png' });
These disciplines apply to state-change verifications — not to trivial checks like "page loaded" or "element exists."
Self-healing for browser scenarios (healer loop)
When a browser script fails, classify the failure before acting:
| Failure type | Signals | Action |
|---|---|---|
| Selector drift | TimeoutError waiting for element, element not found, wrong element clicked |
Re-explore with getPageStructure(), fix selectors, retry (max 2 retries) |
| Timing issue | Race condition, element not yet visible, network not settled | Add waitForSelector / waitForLoadState('domcontentloaded'), retry |
| App bug | Element exists but shows wrong content, wrong status code, console errors, unexpected redirect | Do NOT retry — report the failure with evidence |
| Environment issue | Connection refused, DNS failure, auth expired | Report as blocked, not failed |
Healer loop (max 2 iterations):
- Script fails → read the error message
- Classify: selector/timing vs app bug vs environment
- If selector/timing:
a. Run
getPageStructure(page, { interactiveOnly: true })on the current page state b. Compare observed elements to what the script expected c. Rewrite the failing portion of the script with corrected selectors d. Re-run the script e. If it fails again with the same class of error → report asfailedwith note: "Healer: retried 2x, selector/timing issue persists — may be an app bug" - If app bug: report immediately with evidence (console errors, network failures, screenshots) — do NOT retry
- If environment issue: report as
blockedwith the specific error
Key principle: Retrying an app bug wastes time and masks the real problem. Only retry when the test script is wrong, not the app.
Evidence-justified retries: Each retry must be justified by new evidence — a fresh page structure showing different elements, a corrected selector, a changed page state. Never re-run the same failing action unchanged hoping for a different result.
Step 6b: Screenshot evidence
After testing, capture a screenshot of every UI screen affected by the code changes. Create the directory if needed (mkdir -p tmp/ship/screenshots), then save to tmp/ship/screenshots/<descriptive-slug>.png using Playwright's page.screenshot({ path: ..., fullPage: true }).
If you fix a bug that changes a previously screenshotted screen, retake the screenshot — overwrite the same file. Screenshots must reflect the final state of the code, not intermediate states.
These screenshots are evidence of the tested state. /pr includes them in the PR body when the developer creates the PR.
Challenge blocked scenarios via /debug subprocess
Before finalizing any scenario as blocked, challenge the assumption with a fresh perspective. Spawn a nested Claude Code instance (via the /nest-claude subprocess pattern) that loads /debug with --headless to independently investigate whether the scenario is actually untestable.
Hardcoded short-circuit — do this before any /debug challenge
If the proposed blocked reason mentions any of: "browser", "Playwright", "headless", "needs dev server", "requires UI", "unrunnable in headless" — the scenario is not a blocking candidate. These describe the primary use cases for /browser and the dev-server bootstrap step. Do one of the following before considering blocked:
- If
/browseris not loaded: load it now and re-execute the scenario. - If the dev server is not running: start it (
npm run devor the equivalent command frompackage.jsonscripts/ CLAUDE.md / Makefile) and re-execute. - Only after both are confirmed —
/browserloaded AND dev server responding — and the scenario still cannot be verified, proceed to the/debugchallenge below.
Scenarios that genuinely warrant blocked status: hard auth walls (hCaptcha, SMS MFA, WebAuthn), external services with no local substitute and no sandbox mode, or runtime failures where Chromium literally cannot launch in this environment despite /browser loading cleanly. These are narrow — verify you're in one of them before blocking.
When to challenge: Every scenario that would be marked blocked after the short-circuit above — no exceptions. The cost (~2-5 minutes per scenario) is proportional to the number of blocked scenarios, which should be small. A falsely-blocked scenario that a human later has to verify manually costs far more.
Subprocess instructions:
- Spawn a nested Claude subprocess.
- The subprocess loads
/debugwith--headless. - Provide it with:
- The scenario (id, name, userOutcome, given/when/then)
- The reason you believe the scenario is blocked
- The project root path
- The subprocess investigates independently — it starts from first principles with no bias from your prior assumptions:
- Challenges every assumption about why the scenario is blocked
- Probes the project's test framework capabilities (spy/mock, integration configs), available API keys (current env,
.envfiles), and any other capabilities relevant to this specific project and ecosystem - Attempts to write and run a test, or find an alternative verification path
- If it discovers a bug preventing the test, it fixes it
- Parse the subprocess result:
- If it found a way to verify the scenario: update the scenario to
validatedwith the evidence from the debug investigation. Credit the investigation:"resolvedBy": "debug-challenge"innotes. - If it confirmed the scenario is genuinely blocked: keep
blockedwith the debug analysis as evidence innotes. The investigation trail proves all avenues were exhausted — downstream consumers can see why it's blocked, not just that it's blocked.
- If it found a way to verify the scenario: update the scenario to
Why a subprocess, not inline investigation: The /qa agent has already formed assumptions about why the scenario is blocked. A fresh context (clean child, no conversation history) forces the investigation to start from scratch. /debug's systematic methodology (Triage → Reproduce → Investigate → Classify → Report) ensures thorough investigation rather than confirming the prior assumption.
Step 7: Record results
When tmp/ship/ exists: After each scenario (or batch), update the scenario's status, verifiedVia, notes, and evidence in qa-progress.json. Set verifiedVia to the fidelity level from Step 6 (browser, api, or shell) that reflects how the scenario was actually executed. If multiple levels were used (e.g., browser flow + database state check), record the highest. Do not touch the PR body — a downstream consumer will render it.
Evidence recording (mandatory for every validated/failed scenario): Populate evidence[] with at least one structured proof item that demonstrates what was checked and what was observed. Match evidence type to verification method:
- Browser scenarios:
{type: "video", url: "..."}from Bunny Stream upload, and/or{type: "screenshot", url: "..."}from CDN or local path - API/shell scenarios:
{type: "assertion", check: "file_exists", expected: "plugins/shared/skills/audit/SKILL.md", actual: "exists", pass: true}or{type: "command", cmd: "readlink plugins/eng/skills/audit", stdout: "../../shared/skills/audit", expected: "../../shared/skills/audit", pass: true} - Mixed scenarios: include multiple evidence items (e.g., an assertion + a screenshot)
Evidence makes results auditable — a downstream agent or human can verify the claim without re-executing. Structured assertions are cheap to produce (you already ran the check) and machine-parseable. An empty evidence[] on a validated or failed scenario is a defect.
When tmp/ship/ does not exist: Update the ## Verification section in the PR body directly using the same read → modify → write mechanism from Step 5. Include the fidelity level in the checklist item (e.g., [browser], [api]).
When you find a bug:
When --report-only is active: Record findings in qa-progress.json with status: "failed" and detailed notes describing the bug (symptoms, suspected root cause, affected area, reproduction steps), but do not edit source files, do not load /debug, and do not enter the fix loop. If the bug was discovered outside any planned scenario, add a new scenario to scenarios[] with the next sequential ID and mark it failed with descriptive notes.
When --report-only is NOT active:
First, assess: do you see the root cause, or just the symptom?
-
Root cause is obvious (wrong variable, missing class, off-by-one visible in the code) — fix it directly. Verify. Document.
Regression test:
- Write one WHEN the bug is in application code with adjacent existing tests AND a small test would reliably fail before the fix and pass after. Load
/tddfirst for test-design rules (commit failing test first for bug fixes, mocking at boundaries, mock-tautology prevention, flakiness handling). - Skip (document the coverage gap in the scenario notes) WHEN the bug is UI-visual-only (the QA scenario itself is the regression check), the affected area has no existing test precedent, or the root cause is in a third-party dep, config file, or build pipeline.
- Write one WHEN the bug is in application code with adjacent existing tests AND a small test would reliably fail before the fix and pass after. Load
-
Root cause is unclear (unexpected behavior, cause not visible from the symptom) — load
/debugskill for systematic root cause investigation before attempting a fix. If QA is running in headless mode, pass--headlessto/debugso it iterates freely without per-action permission gates./debugreturns structured findings (root cause, recommended fix, blast radius) — apply the fix based on its findings, then resume QA.
After fixing a bug, record it: update the scenario's status to validated and put the bug description + fix in notes (e.g., "found stale cache; added cache-bust on logout"). If the bug was discovered outside any planned scenario — while navigating between tests or doing exploratory poking — add a new scenario to scenarios[] with the next sequential ID, describe what you found and fixed, and mark it validated with the fix in notes.
Fix-loop self-regulation (cumulative risk score):
When --report-only is active, skip this entire section. No fixes means no risk tracking.
Track a cumulative risk score across all fixes in the QA session (bug fixes AND fixable gap fixes from Step 5b). Persist the risk state in qa-progress.json — read before each fix, write after:
{
"fixLoopState": {
"riskScore": 15,
"fixCount": 12,
"reverts": 0
}
}
Risk increments:
Start at 0%
Each revert: +15% (strongest signal — you undid your own work)
Each fix touching >3 files: +5% (blast radius growing)
Touching unrelated files: +10% (scope creep)
After fix 15: +1% per additional fix (fatigue ramp)
Threshold: STOP fixing at ≥30%
Hard cap: 50 fixes per QA session
- Before each fix: read
fixLoopStatefrom qa-progress.json. After each fix: updateriskScoreandfixCount, write back. - When the threshold is hit, stop fixing. Document remaining issues in qa-progress.json (set status to
"failed", notes explaining the risk score stopped further fixes). - Regression test commits are excluded from the heuristic — writing a test for a fix does not increment the risk score.
- Continue executing remaining test scenarios (read-only observation) even after the fix cap is reached — you can still discover and document issues, just not fix them.
Test suite gap discovery:
When --report-only is active, skip this section. No tests are written in report-only mode — document the gap in the scenario's notes instead (e.g., "missing unit test for session invalidation — recommend adding coverage").
During execution, you may discover behaviors that should have formal test coverage but don't — an edge case with no unit test, a behavior path with no integration test, an untested integration point. Default: document the gap in the scenario's notes as a coverage recommendation (e.g., "missing unit test for session invalidation — recommend adding coverage"). QA's primary job is scenario verification, not chasing coverage — surfacing gaps is valuable even when filling them isn't in scope, and documented gaps flow to /pr as follow-up items.
Exception — write the test only WHEN all of these hold:
- The gap is directly adjacent to a bug you just fixed (pair the test with the fix).
- The test can be written at the tier of existing nearby tests (no new test infrastructure required).
- Writing takes under 5 minutes.
When writing under the exception, load /tdd for test-design rules — tier selection, mocking philosophy, flakiness handling, mock-tautology prevention, test-artifact protection, spec-grounded authoring. Record the test in the scenario's notes alongside the bug fix notes (e.g., "also wrote unit test for session invalidation — no existing coverage").
Step 8: Report and teardown
When tmp/ship/ exists: As a final action before reporting:
-
Write
qaCompletedAtCommitto qa-progress.json with the current HEAD commit hash (git rev-parse HEAD). This marks the boundary between QA and post-QA changes for staleness detection. -
Compute and write
executionSummarytoplanMetadata:"executionSummary": { "validated": 12, "failed": 1, "blocked": 2, "planned": 0 }Count scenarios by status. This saves every downstream consumer from scanning the full scenario array.
-
Compute and write
verdictat the top level of qa-progress.json. Verdict severity is derived from scenario properties (category,source), not from planner-assigned priority labels. No failure ever produces a clean"go":- Any scenario with
source: "journey"failedorblocked→"no-go"(compositional user path broken — integration seams failing) - Any scenario with
category: "ux-flow"tracing to a core user journeyfailedorblocked→"no-go"(happy path broken) - Any scenario involving data-loss or security-sensitive paths
failedorblocked→"no-go"(safety-critical failure) - Any other scenario
failedorblocked→"conditional"(feature works but has documented issues — human decides whether to proceed) - All
validated→"go"
Write the verdict alongside the inputs that produced it (which scenarios triggered the verdict and why) so consumers can both use the pre-computed verdict and verify the computation.
- Any scenario with
The JSON file is your report. A downstream consumer will render it to the PR body. Report completion to the invoker.
When tmp/ship/ does not exist and a PR exists: The ## Verification section in the PR body is your report. Ensure it's up-to-date with all results. Do not add a separate PR comment.
No PR exists: Report directly to the user with:
- Total scenarios tested vs. passed vs. failed vs. blocked
- Bugs found and fixed (with brief description of each)
- Gaps — what could NOT be tested due to tool limitations or environment constraints
- Judgment call — your honest assessment: is this feature ready for human review?
The skill's job is to fix what it can, document what it found, and hand back a clear picture. Unresolvable issues and gaps are documented, not silently swallowed — but they do not block forward progress. The invoker decides what to do about remaining items.
Teardown (mandatory): After reporting, tear down everything bootstrapped in Step 2. Kill dev server, stop Docker containers (docker compose down -v), clean fixture data, remove temporary files in tmp/. Tear down in reverse order of bootstrap. Consult bootstrapResult.teardownRequired for the full list. Leave the environment as it was found.
QA execution boundaries
The one hard boundary: no mutations to cloud/external systems. Everything else is fair game locally. Exhaust all local options before marking anything as unverifiable.
In bounds — exhaust these (the full local arsenal):
- Browser automation (Playwright via
/browser) — navigate, click, fill forms, inspect console/network, record video, audit accessibility, test responsive layouts - Docker containers — spin up databases, caches, queues, mock services via
docker compose up -d. Tear them down when done. - Local dev servers — install deps, start the server, seed the database, start workers
- Shell scripts, REPL sessions, consumer-perspective import scripts
- API calls to locally-running endpoints (
curl localhost:...) - Ad-hoc verification scripts in
tmp/— write, run, delete - Temporary test servers, fixture files, seed data
- Environment variable manipulation, feature flag toggling
- Running the project's own test suite, linters, typecheckers
- Database queries to verify mutation correctness
- Log/telemetry inspection during test flows
- Simulating external service responses locally (MSW, nock, VCR, intercepting HTTP)
- Testing webhook handlers by sending simulated payloads to localhost
- Verifying outbound HTTP calls are correctly formed (intercept and assert, don't send)
Out of bounds (the hard boundary):
- Requests to production URLs, staging environments, or live third-party APIs
- Operations that trigger billing, metering, or quota consumption
- Sending real emails, Slack messages, webhooks to external services
- Accessing production databases or customer data
- Any mutation to external/cloud systems (POST/PUT/DELETE to non-localhost)
When a scenario requires an external service:
- First, check if there's a local substitute: Docker image, local emulator (e.g., LocalStack for AWS, Stripe CLI for webhooks), mock server, or simulated payload
- If a local substitute exists — use it. This counts as real verification.
- If no local substitute exists — verify as much as possible locally (e.g., verify the webhook handler responds correctly to a simulated payload), then mark the scenario
status: "blocked"with notes describing what you verified locally and what a human needs to verify in staging/production blockedscenarios flow to/pras pending human verification items — they are NOT silently dropped
blocked is the safety net, not the first resort. A scenario should only be blocked after you've exhausted Docker, local emulators, simulated payloads, and intercepted requests. If you can verify 80% of the scenario locally and only the final external handoff needs human eyes, describe the 80% you verified and the 20% the human needs to check. Every blocked scenario must include what was attempted.
Calibrating depth
The depth comes from the qa-progress.json plan — execute every scenario in it. The plan was written to be maximally ambitious; your job is to verify every scenario at the highest achievable fidelity.
Testing breadth scales with affected surfaces. When standalone (no qa-progress.json), calibrate how many scenarios to test based on the scope of changes: a single-file bug fix warrants 1-2 targeted scenarios plus a regression check; a multi-file feature touching several surfaces warrants scenarios for each affected surface plus edge cases and error paths. Don't apply a fixed number — let the number of affected surfaces drive the count.
Under-testing looks like (these are the failures to avoid):
- Declaring confidence from unit tests alone when the feature has user-facing surfaces
- Claiming a scenario is "covered by tests" when those tests mock the service boundary
- Never opening a browser when the feature has a UI —
/browseris always loadable and runs headless Playwright by default - Skipping error-path testing
- Not testing the interaction between new and existing code
- Not checking database state after mutations
- Not spinning up Docker when docker-compose.yml exists and would give you a real database
- Marking scenarios as
blockedwithout first loading/browser, starting the dev server, spinning up Docker, or trying local emulators, simulated payloads, and ad-hoc scripts
Anti-patterns
- Treating QA as a checkbox. "I tested it" means nothing without specifics. Every scenario must have a concrete action and expected outcome.
- Only testing the happy path. Real users encounter errors, edge cases, and unexpected states. Test those.
- Silent gaps. If you can't test something, say so explicitly. An undocumented gap is worse than a documented one.
- Confusing orchestration headless mode with browser unavailability. Ship's
--headlessflag and container detection enter orchestration autonomy — no user gates, no negotiation checkpoints. They do NOT mean "the environment cannot run a browser."/browser's Playwright engine runs headless Chromium by default, which is exactly what--headlessship runs need. Load/browser, run the scenarios. - Marking scenarios as
blockedfor "needs browser", "needs Playwright", "needs dev server", "requires UI", or "unrunnable in headless". These are the primary use cases for/browser+ the Step 2 bootstrap — not valid blocked reasons. If you wrote one of these as a blocked reason, stop: you haven't actually attempted to load/browseror start the dev server. Do that first, then re-evaluate. - Using Peekaboo or Claude-in-Chrome for web testing.
mcp__peekaboo__*andmcp__claude-in-chrome__*tools are NOT for QA web page testing. Use/browser(Playwright). Peekaboo is for OS-level macOS automation only. Chrome extension is for ad-hoc, user-directed browser tasks outside QA.
More from inkeep/team-skills
cold-email
Generate cold emails for B2B personas. Use when asked to write cold outreach, sales emails, or prospect messaging. Supports 19 persona archetypes (Founder-CEO, CTO, VP Engineering, CIO, CPO, Product Directors, VP CX, Head of Support, Support Ops, DevRel, Head of Docs, Technical Writer, Head of Community, VP Growth, Head of AI, etc.). Can generate first-touch and follow-up emails. When a LinkedIn profile URL is provided, uses Crustdata MCP to enrich prospect data (name, title, company, career history, recent posts) for deep personalization.
54spec
Drive an evidence-driven, iterative product+engineering spec process that produces a full PRD + technical spec (often as SPEC.md). Use when scoping a feature or product surface area end-to-end; defining requirements; researching external/internal prior art; mapping current system behavior; comparing design options; making 1-way-door decisions; negotiating scope; and maintaining a live Decision Log + Open Questions backlog. Triggers: spec, PRD, proposal, technical spec, RFC, scope this, design doc, end-to-end requirements, scope plan, tradeoffs, open questions.
54ship
Orchestrate any code change from requirements to review-ready branch — scope-calibrated from small fixes to full features. Composes /spec, /implement, and /research with depth that scales to the task: lightweight spec and direct implementation for bug fixes and config changes, full rigor for features. Produces tested, locally reviewed, documented code on a feature branch. The developer pushes the branch and creates the PR. Use for ALL implementation work regardless of perceived scope — the workflow adapts depth, never skips phases. Triggers: ship, ship it, feature development, implement end to end, spec to PR, implement this, fix this, let's implement, let's go with that, build this, make the change, full stack implementation, autonomous development.
52docs
Write or update documentation for engineering changes — both product-facing (user docs, API reference, guides) and internal (architecture docs, runbooks, inline code docs). Builds a world model of what changed and traces transitive documentation consequences across all affected surfaces. Discovers and uses repo-specific documentation skills, style guides, and conventions. Standalone or composable with /ship. Triggers: docs, documentation, write docs, update docs, document the changes, product docs, internal docs, changelog, migration guide.
52implement
Convert SPEC.md to spec.json, craft the implementation prompt, and execute the iteration loop via subprocess. Use when converting specs to spec.json, preparing implementation artifacts, running the iteration loop, or implementing features autonomously. Triggers: implement, spec.json, convert spec, implementation prompt, execute implementation, run implementation.
52write-agent
Design and write high-quality Claude Code agents and agent prompts. Use when creating or updating .claude/agents/*.md for (1) single-purpose subagents (reviewers, implementers, researchers) and (2) workflow orchestrators (multi-phase coordinators like pr-review, feature-development, bug-fix). Covers delegation triggers, tool/permission/model choices, Task-tool orchestration, phase handoffs, aggregation, iteration gates, and output contracts. Also use when deciding between subagents vs skills vs always-on repo guidance.
50