autonomous-tests

SKILL.md

Dynamic Context

  • Args: $ARGUMENTS
  • Branch: !git branch --show-current
  • Unstaged: !git diff --stat HEAD 2>/dev/null | tail -5
  • Staged: !git diff --cached --stat 2>/dev/null | tail -5
  • Commits: !git log --oneline -5 2>/dev/null
  • Docker: !docker compose ps 2>/dev/null | head -10 || echo "No docker-compose found"
  • Config: !test -f .claude/autonomous-tests.json && echo "YES" || echo "NO -- first run"
  • Agent Teams: !python3 -c "import json;s=json.load(open('$HOME/.claude/settings.json'));print('ENABLED' if s.get('env',{}).get('CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS')=='1' else 'DISABLED')" 2>/dev/null || echo "DISABLED -- settings not found"

Role

Project-agnostic autonomous E2E test runner. Exercise features against the live LOCAL stack, verify state at every step, produce documentation, never touch production.

Arguments: $ARGUMENTS

Arg Meaning
(empty) Default: working-tree (staged + unstaged) with smart doc analysis
staged Staged changes only
unstaged Unstaged changes only
N (number) Last N commits only (e.g., 1 = last commit, 3 = last 3)
working-tree Staged + unstaged changes (same as default)

Smart doc analysis is always active: identify which docs/ files are relevant to the changed code by path, feature name, and cross-references — read only those, never all docs.

Print resolved scope, then proceed without waiting.


Phase 0 — Configuration

Step 0: Prerequisites Check

Read ~/.claude/settings.json and check two things:

  1. Agent teams feature flag: verify env.CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS is "1". If missing or not "1", STOP and tell the user:

    Agent teams are required for this skill but not enabled. Run: bash <skill-dir>/scripts/setup-hook.sh This enables the CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS flag and the ExitPlanMode approval hook in your settings. Do not proceed until the flag is confirmed enabled.

  2. ExitPlanMode hook (informational): if the PreToolUseExitPlanMode hook is not present, inform the user:

    The ExitPlanMode approval hook ensures test plans require your approval before execution (even in dontAsk mode). This skill includes it as a skill-scoped hook, so it works automatically during /autonomous-tests runs. To also enable it globally, the setup script above already handles it. Then continue — do not block on this.

Step 1: Run test -f .claude/autonomous-tests.json && echo "CONFIG_EXISTS" || echo "CONFIG_MISSING" in Bash.

Schema reference: references/config-schema.json.

If output is CONFIG_EXISTS (returning run):

  1. Read .claude/autonomous-tests.json
  2. Validate config version: check that version equals 3 and that the required fields (project, database, testing) exist. If validation fails, warn the user and re-run the first-run setup below instead.
  3. Verify config trust: compute a SHA-256 hash of the config content (excluding the _configHash and lastRun fields) by running: python3 -c "import json,hashlib;d=json.load(open('.claude/autonomous-tests.json'));[d.pop(k,None) for k in ('_configHash','lastRun')];print(hashlib.sha256(json.dumps(d,sort_keys=True).encode()).hexdigest())". Then check if this hash exists in the trust store at ~/.claude/trusted-configs/ (the trust file is named after a hash of the project root: python3 -c "import hashlib,os;print(hashlib.sha256(os.path.realpath('.').encode()).hexdigest()[:16])" + .sha256). If the trust file is missing or its content doesn't match the computed hash, the config has not been approved by this user — show the full config to the user and ask for explicit confirmation before continuing. If confirmed, write the new hash to the trust store file (mkdir -p ~/.claude/trusted-configs/ first). This prevents a malicious config committed to a repo from bypassing approval, since the trust store lives outside the repo in the user's home directory.
  4. Re-scan for new services and update config if needed
  5. Get current UTC time by running date -u +"%Y-%m-%dT%H:%M:%SZ" in Bash, then update lastRun with that exact value (never guess the time)
  6. If userContext is missing or all arrays are empty, run the User Context Questionnaire below once, then save answers to config
  7. Skip to Phase 1 — do NOT run first-run steps below

If output is CONFIG_MISSING (first run only):

  1. Auto-extract from CLAUDE.md files + compose files + env files + package manifests
  2. Detect project topology — set project.topology to one of:
    • single — one repo, one project
    • monorepo — one repo, multiple packages (detected via: workspace configs like lerna.json, nx.json, turbo.json, pnpm-workspace.yaml; multiple package.json in subdirs; or conventional directory structures like backend/ + frontend/, server/ + client/, api/ + web/, packages/)
    • multi-repo — separate repos that work together as a system (detected via: CLAUDE.md references to other paths, sibling directories with their own .git, shared docker-compose networking, cross-repo API URLs like localhost:3000 called from another project)
  3. Discover related projects — scan for sibling directories with .git or package.json, grep CLAUDE.md and compose files for paths outside the project root. For each candidate found, ask the user: "Is {path} part of this system? What is its role?" Populate the relatedProjects array with confirmed entries.
  4. User Context Questionnaire — present all questions at once, accept partial answers:
    • Any known flaky areas or intermittent failures?
    • Test user credentials to use (reference env var names or role names, never raw values)?
    • Any specific testing priorities or focus areas?
    • Any additional notes for the test runner? Store answers in the userContext section of the config.
  5. Propose config → STOP and wait for user to confirm → write config
  6. Stamp config trust: after writing, compute the config hash with python3 -c "import json,hashlib;d=json.load(open('.claude/autonomous-tests.json'));[d.pop(k,None) for k in ('_configHash','lastRun')];print(hashlib.sha256(json.dumps(d,sort_keys=True).encode()).hexdigest())" and write the result to the trust store at ~/.claude/trusted-configs/{project-hash}.sha256 (create the directory if needed). This marks the config as user-approved in a location outside the repo that cannot be forged by a committed file.
  7. If project CLAUDE.md < 140 lines and lacks startup instructions, append max 10 lines.

Phase 1 — Safety

ABORT if any production indicators found in .env files: sk_live_, pk_live_, *LIVE*SECRET*, NODE_ENV=production, production DB endpoints (RDS, Atlas without dev/stg/test), non-local API URLs. Show variable NAME only, never the value. Run sandboxCheck commands from config. Verify Docker is local.

Phase 2 — Service Startup

For each service in config and each related project with a startCommand: health check → if unhealthy, start + poll 30s → if still unhealthy, STOP for user guidance. Start webhook listeners in background. Tail logs for errors during execution.

Phase 3 — Autonomous Feature Identification & Discovery

All identification is fully autonomous — derive everything from the code diff and codebase. Never ask the user what to test.

  1. Get changed files from git based on scope arguments — include related projects (relatedProjects[].path) when tracing cross-project dependencies (e.g., backend API change that affects webapp pages)
  2. Read every changed file. For each, build a feature map:
    • API endpoints affected (routes, controllers, handlers)
    • Database operations (queries, writes, schema changes, index usage)
    • External service integrations (webhooks, SDK calls, third-party APIs)
    • Business logic and validation rules
    • Authentication/authorization flows touched
    • Signal/event chains (pub/sub, queues, outbox patterns)
  3. Trace the full dependency graph: callers → changed code → callees. Follow imports across files and project boundaries to understand the complete blast radius.
  4. Smart doc analysis: identify docs relevant to the changed code by matching file paths, feature names, endpoint references, and testing.contextFiles entries. Scan the docs/ tree but read only relevant files — never read all docs indiscriminately. Skip purely historical or unrelated docs.
  5. Produce a Feature Context Document (kept in memory, not written to disk) summarizing: all features touched, all endpoints, all DB collections/tables affected, all external services involved, and all edge cases identified from reading the code (error handlers, validation branches, race conditions, retry logic). This document is cascaded to every agent in Phase 5.

Phase 4 — Test Plan (Plan Mode)

Enter plan mode (Use /plan). The plan MUST start with a "Context Reload" section as Step 0 containing:

  • Instruction to re-read this skill file (the SKILL.md that launched this session)
  • Instruction to read the config: .claude/autonomous-tests.json
  • Instruction to read the templates: the references/templates.md file from this skill
  • The resolved scope arguments: $ARGUMENTS
  • The current branch name and commit range being tested
  • Any related project paths involved
  • Key findings from Phase 3 (affected modules, endpoints, dependencies)
  • The userContext from config (flaky areas, testing priorities, notes)
  • Credential assignment plan for agent teams (see Phase 5)

This ensures that when context is cleared after plan approval, the executing agent can fully reconstruct the session state.

Then design test suites covering all of the following categories:

  1. Happy path — normal expected flows end-to-end
  2. Invalid inputs & validation — malformed data, missing fields, wrong types, boundary values
  3. Duplicate/idempotent requests — send the same API call 2-3 times rapidly, verify no duplicate DB records, no double charges, no duplicate side-effects (emails, webhooks, events)
  4. Error handling — trigger every error branch visible in the diff (network failures, invalid state transitions, auth failures, permission denials)
  5. Unexpected database changes — verify no orphaned records, no missing references, no unintended field mutations, no index-less slow queries on new fields
  6. Race conditions & timing — concurrent writes to same resource, out-of-order webhook delivery, expired tokens mid-flow
  7. Security — auth bypass attempts, injection inputs, privilege escalation, data leakage between users
  8. Edge cases from code reading — every if/else, try/catch, guard clause, and fallback in the changed code should have at least one test targeting it
  9. Regression — existing unit tests if configured, plus re-verify any previously broken flows

Each suite needs: name, objective, pre-conditions, steps with expected outcomes, teardown, and explicit verification queries (DB checks, log checks, API response checks). Wait for user approval.

Phase 5 — Execution (Agent Teams)

Use TeamCreate to create a test team. Spawn general-purpose Agents as teammates — one per approved suite. Always use model: "opus" when spawning agents (Opus 4.6 has adaptive reasoning/thinking built-in — no budget to configure, it thinks as deeply as needed automatically). Coordinate via TaskCreate/TaskUpdate and SendMessage.

Credential sharing — CRITICAL: Assign each agent a distinct test credential from userContext.testCredentials to prevent session conflicts (e.g., one agent logging in invalidates another's token). If only one credential exists, run agents sequentially — never in parallel with shared auth. Include the credential assignment in each agent's task description.

Cascading context — CRITICAL: Every agent MUST receive the full Feature Context Document from Phase 3 in its task description. This includes: all features touched, all endpoints, all DB collections affected, all external services involved, and all identified edge cases. Agents need this complete picture to understand cross-feature side-effects (e.g., testing endpoint A may break endpoint B's state).

Anomaly detection: Each agent must actively watch for and report:

  • Duplicate records created by repeated API calls
  • Unexpected DB field changes outside the tested operation
  • Warning/error log entries that appear during test execution
  • Slow queries or missing indexes (check docker logs and DB explain plans)
  • Orphaned or inconsistent references between collections/tables
  • Auth tokens or sessions behaving unexpectedly (expired mid-flow, leaked between users)
  • Any response field or status code that differs from what the code intends

Execution flow:

  1. Create tasks for each suite via TaskCreate — include: env details from config, exact test steps, verification queries, teardown instructions, the full Feature Context Document, and assigned credential
  2. Assign tasks to agents via TaskUpdate with owner
  3. If credentials allow (distinct per agent), agents may run in parallel; otherwise run sequentially — wait for each to complete before starting the next
  4. Report PASS/FAIL after each suite completes, including any anomalies detected
  5. After all suites complete, shut down teammates via SendMessage with type: "shutdown_request"

Phase 6 — Fix Cycle

  • Runtime-fixable (env var, container, stuck job): fix → re-run affected suite → max 3 cycles
  • Code bug: document with full context (file, line, expected vs actual) → ask user before proceeding

Phase 7 — Documentation

Generate docs in dirs from config (create dirs if needed). Get filename timestamp by running date -u +"%Y-%m-%d-%H-%M-%S" in Bash (never guess the time). Filename pattern: {timestamp}_{semantic-name}.md. Read references/templates.md for the exact output structure of each file type before writing.

Generate up to four doc types based on findings:

  • test-results: Always generated. Full E2E results with pass/fail per suite.
  • pending-fixes: Generated when code bugs or infrastructure issues are found.
  • pending-guided-tests: Generated when tests need browser/visual/physical-device interaction.
  • pending-autonomous-tests: Generated when automatable tests were identified but not run (time/scope/dependency constraints).

On re-runs: if docs exist for this feature + date → append a "Re-run" section instead of duplicating.

Phase 8 — Cleanup

Remove only test data created during this run (identified by testDataPrefix from config). Never touch pre-existing data. Log every action. Verify cleanup with a final DB query.


Rules

  • Never modify production data or connect to production services
  • Never expose credentials, keys, or tokens in documentation output
  • Always enter plan mode before executing tests (Phase 4)
  • Always delegate test suites to Agent Teams — never run tests in main conversation
  • Always spawn agents with model: "opus" for maximum reasoning capability
  • Be idempotent — skip or reset cleanly if test data already exists
  • Treat ALL external APIs with care — add delays between calls, use sandbox/test modes, minimize unnecessary requests
  • Never share auth tokens/sessions between agents — assign distinct credentials or run sequentially (see Phase 5)
  • If no unit tests exist → note in report, do not treat as a failure
  • Use UTC timestamps everywhere (docs, config, logs) — always obtain from date -u, never guess
Weekly Installs
1
First Seen
Mar 2, 2026
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1