autonomous-tests
Dynamic Context
- Args: $ARGUMENTS
- Branch: !
git branch --show-current - Unstaged: !
git diff --stat HEAD 2>/dev/null | tail -5 - Staged: !
git diff --cached --stat 2>/dev/null | tail -5 - Commits: !
git log --oneline -5 2>/dev/null - Docker: !
docker compose ps 2>/dev/null | head -10 || echo "No docker-compose found" - Config: !
test -f .claude/autonomous-tests.json && echo "YES" || echo "NO -- first run" - Agent Teams: !
python3 -c "import json;s=json.load(open('$HOME/.claude/settings.json'));print('ENABLED' if s.get('env',{}).get('CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS')=='1' else 'DISABLED')" 2>/dev/null || echo "DISABLED -- settings not found"
Role
Project-agnostic autonomous E2E test runner. Exercise features against the live LOCAL stack, verify state at every step, produce documentation, never touch production.
Arguments: $ARGUMENTS
| Arg | Meaning |
|---|---|
| (empty) | Default: working-tree (staged + unstaged) with smart doc analysis |
staged |
Staged changes only |
unstaged |
Unstaged changes only |
N (number) |
Last N commits only (e.g., 1 = last commit, 3 = last 3) |
working-tree |
Staged + unstaged changes (same as default) |
Smart doc analysis is always active: identify which docs/ files are relevant to the changed code by path, feature name, and cross-references — read only those, never all docs.
Print resolved scope, then proceed without waiting.
Phase 0 — Configuration
Step 0: Prerequisites Check
Read ~/.claude/settings.json and check two things:
-
Agent teams feature flag: verify
env.CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMSis"1". If missing or not"1", STOP and tell the user:Agent teams are required for this skill but not enabled. Run:
bash <skill-dir>/scripts/setup-hook.shThis enables theCLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMSflag and the ExitPlanMode approval hook in your settings. Do not proceed until the flag is confirmed enabled. -
ExitPlanMode hook (informational): if the
PreToolUse→ExitPlanModehook is not present, inform the user:The
ExitPlanModeapproval hook ensures test plans require your approval before execution (even indontAskmode). This skill includes it as a skill-scoped hook, so it works automatically during/autonomous-testsruns. To also enable it globally, the setup script above already handles it. Then continue — do not block on this.
Step 1: Run test -f .claude/autonomous-tests.json && echo "CONFIG_EXISTS" || echo "CONFIG_MISSING" in Bash.
Schema reference: references/config-schema.json.
If output is CONFIG_EXISTS (returning run):
- Read
.claude/autonomous-tests.json - Validate config version: check that
versionequals3and that the required fields (project,database,testing) exist. If validation fails, warn the user and re-run the first-run setup below instead. - Verify config trust: compute a SHA-256 hash of the config content (excluding the
_configHashandlastRunfields) by running:python3 -c "import json,hashlib;d=json.load(open('.claude/autonomous-tests.json'));[d.pop(k,None) for k in ('_configHash','lastRun')];print(hashlib.sha256(json.dumps(d,sort_keys=True).encode()).hexdigest())". Then check if this hash exists in the trust store at~/.claude/trusted-configs/(the trust file is named after a hash of the project root:python3 -c "import hashlib,os;print(hashlib.sha256(os.path.realpath('.').encode()).hexdigest()[:16])"+.sha256). If the trust file is missing or its content doesn't match the computed hash, the config has not been approved by this user — show the full config to the user and ask for explicit confirmation before continuing. If confirmed, write the new hash to the trust store file (mkdir -p ~/.claude/trusted-configs/first). This prevents a malicious config committed to a repo from bypassing approval, since the trust store lives outside the repo in the user's home directory. - Re-scan for new services and update config if needed
- Get current UTC time by running
date -u +"%Y-%m-%dT%H:%M:%SZ"in Bash, then updatelastRunwith that exact value (never guess the time) - If
userContextis missing or all arrays are empty, run the User Context Questionnaire below once, then save answers to config - Skip to Phase 1 — do NOT run first-run steps below
If output is CONFIG_MISSING (first run only):
- Auto-extract from CLAUDE.md files + compose files + env files + package manifests
- Detect project topology — set
project.topologyto one of:single— one repo, one projectmonorepo— one repo, multiple packages (detected via: workspace configs likelerna.json,nx.json,turbo.json,pnpm-workspace.yaml; multiplepackage.jsonin subdirs; or conventional directory structures likebackend/+frontend/,server/+client/,api/+web/,packages/)multi-repo— separate repos that work together as a system (detected via: CLAUDE.md references to other paths, sibling directories with their own.git, shared docker-compose networking, cross-repo API URLs likelocalhost:3000called from another project)
- Discover related projects — scan for sibling directories with
.gitorpackage.json, grep CLAUDE.md and compose files for paths outside the project root. For each candidate found, ask the user: "Is{path}part of this system? What is its role?" Populate therelatedProjectsarray with confirmed entries. - User Context Questionnaire — present all questions at once, accept partial answers:
- Any known flaky areas or intermittent failures?
- Test user credentials to use (reference env var names or role names, never raw values)?
- Any specific testing priorities or focus areas?
- Any additional notes for the test runner?
Store answers in the
userContextsection of the config.
- Propose config → STOP and wait for user to confirm → write config
- Stamp config trust: after writing, compute the config hash with
python3 -c "import json,hashlib;d=json.load(open('.claude/autonomous-tests.json'));[d.pop(k,None) for k in ('_configHash','lastRun')];print(hashlib.sha256(json.dumps(d,sort_keys=True).encode()).hexdigest())"and write the result to the trust store at~/.claude/trusted-configs/{project-hash}.sha256(create the directory if needed). This marks the config as user-approved in a location outside the repo that cannot be forged by a committed file. - If project CLAUDE.md < 140 lines and lacks startup instructions, append max 10 lines.
Phase 1 — Safety
ABORT if any production indicators found in .env files: sk_live_, pk_live_, *LIVE*SECRET*, NODE_ENV=production, production DB endpoints (RDS, Atlas without dev/stg/test), non-local API URLs. Show variable NAME only, never the value. Run sandboxCheck commands from config. Verify Docker is local.
Phase 2 — Service Startup
For each service in config and each related project with a startCommand: health check → if unhealthy, start + poll 30s → if still unhealthy, STOP for user guidance. Start webhook listeners in background. Tail logs for errors during execution.
Phase 3 — Autonomous Feature Identification & Discovery
All identification is fully autonomous — derive everything from the code diff and codebase. Never ask the user what to test.
- Get changed files from git based on scope arguments — include related projects (
relatedProjects[].path) when tracing cross-project dependencies (e.g., backend API change that affects webapp pages) - Read every changed file. For each, build a feature map:
- API endpoints affected (routes, controllers, handlers)
- Database operations (queries, writes, schema changes, index usage)
- External service integrations (webhooks, SDK calls, third-party APIs)
- Business logic and validation rules
- Authentication/authorization flows touched
- Signal/event chains (pub/sub, queues, outbox patterns)
- Trace the full dependency graph: callers → changed code → callees. Follow imports across files and project boundaries to understand the complete blast radius.
- Smart doc analysis: identify docs relevant to the changed code by matching file paths, feature names, endpoint references, and
testing.contextFilesentries. Scan thedocs/tree but read only relevant files — never read all docs indiscriminately. Skip purely historical or unrelated docs. - Produce a Feature Context Document (kept in memory, not written to disk) summarizing: all features touched, all endpoints, all DB collections/tables affected, all external services involved, and all edge cases identified from reading the code (error handlers, validation branches, race conditions, retry logic). This document is cascaded to every agent in Phase 5.
Phase 4 — Test Plan (Plan Mode)
Enter plan mode (Use /plan). The plan MUST start with a "Context Reload" section as Step 0 containing:
- Instruction to re-read this skill file (the SKILL.md that launched this session)
- Instruction to read the config:
.claude/autonomous-tests.json - Instruction to read the templates: the
references/templates.mdfile from this skill - The resolved scope arguments:
$ARGUMENTS - The current branch name and commit range being tested
- Any related project paths involved
- Key findings from Phase 3 (affected modules, endpoints, dependencies)
- The
userContextfrom config (flaky areas, testing priorities, notes) - Credential assignment plan for agent teams (see Phase 5)
This ensures that when context is cleared after plan approval, the executing agent can fully reconstruct the session state.
Then design test suites covering all of the following categories:
- Happy path — normal expected flows end-to-end
- Invalid inputs & validation — malformed data, missing fields, wrong types, boundary values
- Duplicate/idempotent requests — send the same API call 2-3 times rapidly, verify no duplicate DB records, no double charges, no duplicate side-effects (emails, webhooks, events)
- Error handling — trigger every error branch visible in the diff (network failures, invalid state transitions, auth failures, permission denials)
- Unexpected database changes — verify no orphaned records, no missing references, no unintended field mutations, no index-less slow queries on new fields
- Race conditions & timing — concurrent writes to same resource, out-of-order webhook delivery, expired tokens mid-flow
- Security — auth bypass attempts, injection inputs, privilege escalation, data leakage between users
- Edge cases from code reading — every
if/else,try/catch, guard clause, and fallback in the changed code should have at least one test targeting it - Regression — existing unit tests if configured, plus re-verify any previously broken flows
Each suite needs: name, objective, pre-conditions, steps with expected outcomes, teardown, and explicit verification queries (DB checks, log checks, API response checks). Wait for user approval.
Phase 5 — Execution (Agent Teams)
Use TeamCreate to create a test team. Spawn general-purpose Agents as teammates — one per approved suite. Always use model: "opus" when spawning agents (Opus 4.6 has adaptive reasoning/thinking built-in — no budget to configure, it thinks as deeply as needed automatically). Coordinate via TaskCreate/TaskUpdate and SendMessage.
Credential sharing — CRITICAL: Assign each agent a distinct test credential from userContext.testCredentials to prevent session conflicts (e.g., one agent logging in invalidates another's token). If only one credential exists, run agents sequentially — never in parallel with shared auth. Include the credential assignment in each agent's task description.
Cascading context — CRITICAL: Every agent MUST receive the full Feature Context Document from Phase 3 in its task description. This includes: all features touched, all endpoints, all DB collections affected, all external services involved, and all identified edge cases. Agents need this complete picture to understand cross-feature side-effects (e.g., testing endpoint A may break endpoint B's state).
Anomaly detection: Each agent must actively watch for and report:
- Duplicate records created by repeated API calls
- Unexpected DB field changes outside the tested operation
- Warning/error log entries that appear during test execution
- Slow queries or missing indexes (check
docker logsand DB explain plans) - Orphaned or inconsistent references between collections/tables
- Auth tokens or sessions behaving unexpectedly (expired mid-flow, leaked between users)
- Any response field or status code that differs from what the code intends
Execution flow:
- Create tasks for each suite via
TaskCreate— include: env details from config, exact test steps, verification queries, teardown instructions, the full Feature Context Document, and assigned credential - Assign tasks to agents via
TaskUpdatewithowner - If credentials allow (distinct per agent), agents may run in parallel; otherwise run sequentially — wait for each to complete before starting the next
- Report PASS/FAIL after each suite completes, including any anomalies detected
- After all suites complete, shut down teammates via
SendMessagewithtype: "shutdown_request"
Phase 6 — Fix Cycle
- Runtime-fixable (env var, container, stuck job): fix → re-run affected suite → max 3 cycles
- Code bug: document with full context (file, line, expected vs actual) → ask user before proceeding
Phase 7 — Documentation
Generate docs in dirs from config (create dirs if needed). Get filename timestamp by running date -u +"%Y-%m-%d-%H-%M-%S" in Bash (never guess the time). Filename pattern: {timestamp}_{semantic-name}.md. Read references/templates.md for the exact output structure of each file type before writing.
Generate up to four doc types based on findings:
- test-results: Always generated. Full E2E results with pass/fail per suite.
- pending-fixes: Generated when code bugs or infrastructure issues are found.
- pending-guided-tests: Generated when tests need browser/visual/physical-device interaction.
- pending-autonomous-tests: Generated when automatable tests were identified but not run (time/scope/dependency constraints).
On re-runs: if docs exist for this feature + date → append a "Re-run" section instead of duplicating.
Phase 8 — Cleanup
Remove only test data created during this run (identified by testDataPrefix from config). Never touch pre-existing data. Log every action. Verify cleanup with a final DB query.
Rules
- Never modify production data or connect to production services
- Never expose credentials, keys, or tokens in documentation output
- Always enter plan mode before executing tests (Phase 4)
- Always delegate test suites to Agent Teams — never run tests in main conversation
- Always spawn agents with
model: "opus"for maximum reasoning capability - Be idempotent — skip or reset cleanly if test data already exists
- Treat ALL external APIs with care — add delays between calls, use sandbox/test modes, minimize unnecessary requests
- Never share auth tokens/sessions between agents — assign distinct credentials or run sequentially (see Phase 5)
- If no unit tests exist → note in report, do not treat as a failure
- Use UTC timestamps everywhere (docs, config, logs) — always obtain from
date -u, never guess