autonomous-tests-swarm
Dynamic Context
- Args: $ARGUMENTS
- Branch: !
git branch --show-current - Unstaged: !
git diff --stat HEAD 2>/dev/null | tail -5 - Staged: !
git diff --cached --stat 2>/dev/null | tail -5 - Commits: !
git log --oneline -5 2>/dev/null - Docker: !
docker compose ps 2>/dev/null | head -10 || echo "No docker-compose found" - Docker Context: !
docker context show 2>/dev/null || echo "unknown" - Config: !
test -f .claude/autonomous-tests.json && echo "YES" || echo "NO -- first run" - Swarm Config: !
python3 -c "import json;c=json.load(open('.claude/autonomous-tests.json'));print('YES' if 'swarm' in c else 'NO -- needs setup')" 2>/dev/null || echo "NO -- config missing"
Role
Project-agnostic autonomous test runner with parallel execution via isolated Docker stacks. Analyzes code changes, auto-detects project capabilities, generates comprehensive test plans covering integration tests (curl-based API testing) and E2E tests (browser-based user flows), executes integration suites in PARALLEL (each agent with its own Docker stack on unique ports) and E2E suites SEQUENTIALLY, and documents findings for the test-fix-retest cycle.
Test Taxonomy
This skill generates three types of tests. Read ../autonomous-tests/references/test-taxonomy.md for full definitions:
- Integration Tests: API-level via
curl. Security-focused request/response analysis. Classification:integration/api. - E2E Tests: Frontend-to-backend flows via
agent-browser/Playwright + chrome-devtools-mcp. Classification:e2e/webapp,e2e/mobile. - Regression Tests: Unit tests (
testing.unitTestCommand) run ONCE at the end. Classification:regression/unit.
Read ../autonomous-tests/references/security-checklist.md for the 17-item security observation checklist applied to each suite.
Orchestrator Protocol
The main agent is the Orchestrator. It coordinates phases but NEVER executes operational work.
Orchestrator MUST delegate to agents:
- Bash commands (capabilities scan, health checks, port scanning, cleanup)
- Source code reading (only agents read application source)
- File generation (docs, reports)
- Test execution, fix application, verification
Orchestrator MAY directly:
- Read config, SKILL.md, and reference files
- Run
date -ufor timestamps,test -ffor file checks - Enter/exit plan mode
- Use AskUserQuestion for user interaction
- Use Agent() to spawn subagents for delegation
- Compile summaries from agent reports
- Make phase-gating decisions (proceed/stop/abort)
Parallel spawning: For integration suites, Orchestrator spawns MULTIPLE background subagents (run_in_background: true) up to maxAgents concurrent. E2E suites remain strictly sequential.
Reporting hierarchy: Agent -> Orchestrator -> Plan
Task Tracking Protocol
The orchestrator uses TaskCreate and TaskUpdate to track phase-level progress. This provides visible progress indicators and a deterministic execution sequence for the post-reset orchestrator.
When to create tasks: Immediately after plan approval (start of Phase 4), create ALL phase tasks at once. This gives a complete checklist that survives context pressure.
Task lifecycle: pending -> in_progress (when phase starts) -> completed (when phase finishes). If a phase triggers STOP/ABORT, update the task description with the reason before halting.
Task naming: Phase {N}: {Phase Name} (e.g., Phase 4.1: Service Restoration). For guided happy-path, each scenario gets its own subtask: Phase 4.6.{M}: {Scenario Name}.
Dependency chaining: Each task blocks the next so the orchestrator processes them in order.
Arguments: $ARGUMENTS
| Arg | Meaning |
|---|---|
| (empty) | Default: working-tree (staged + unstaged) with smart doc analysis |
staged |
Staged changes only |
unstaged |
Unstaged changes only |
N (number) |
Last N commits (e.g., 1 = last commit) |
working-tree |
Staged + unstaged (same as default) |
file:<path> |
.md doc as additional test context. Combinable. |
rescan |
Force capabilities re-scan. Combinable. |
guided |
User augmentation mode — bypasses git diff. Alone: prompts for doc or description. |
guided "desc" |
Description-based: happy-path workflows only, user performs actions. |
guided file:<path> |
Doc-based: happy-path workflows only, user performs actions. |
Space-separated, combinable (e.g., staged file:docs/feature.md rescan). file: validated as existing .md relative to project root.
Guided mode — user augmentation (NOT automation):
- Doc-based (
guided file:<path>or pick fromdocs//_autonomous/pending-guided-tests/): happy-path workflows only. - Description-based (
guided "description"or describe when prompted): happy-path workflows only.
User performs all actions on their real device/browser. Claude provides step-by-step instructions and verifies results via DB queries/API/logs. Only happy-path workflows in guided mode. Categories 2-8 handled exclusively in autonomous mode — NEVER in guided session. No agent-browser, no Playwright — guided mode never loads or uses browser automation tools.
guided alone prompts via AskUserQuestion to pick a doc or describe a feature. Combinable with rescan but NOT with staged/unstaged/N/working-tree (git-scope args incompatible — guided bypasses git diff).
Smart doc analysis always active in standard mode: match docs/ files to changed code by path, feature name, cross-references — read only relevant docs.
Print resolved scope, then proceed without waiting.
Phase 0 — Bootstrap
Config hash method: python3 -c "import json,hashlib;d=json.load(open('.claude/autonomous-tests.json'));[d.pop(k,None) for k in ('_configHash','lastRun','capabilities')];print(hashlib.sha256(json.dumps(d,sort_keys=True).encode()).hexdigest())" — referenced throughout as "Config hash method".
Step 0: Prerequisites Check — Read ~/.claude/settings.json:
- ExitPlanMode hook (informational): if missing -> inform user the skill-scoped hook handles it automatically; global setup available via the script. Continue.
- AskUserQuestion hook (informational): same as above. Continue.
Step 1: Capabilities Scan — Triggers: rescan arg, capabilities missing, or lastScanned older than rescanThresholdDays (default 7 days). If none -> use cache.
Spawn Explore agent (subagent_type: "Explore", thoroughness: "medium") to perform:
- Docker MCP Discovery:
mcp-findfor MCPs matching service names and generic queries. Recordname,description,mode;safe: trueonly for known sandbox MCPs. If unavailable -> empty array. - Frontend Testing:
which agent-browser,which playwright/npx playwright --version-> setfrontendTestingbooleans. - Chrome DevTools MCP Detection: Run
mcp-findfor chrome-devtools; scan~/.claude.jsonand~/.claude/settings.jsonformcpServerscontainingchrome-devtools. Store incapabilities.frontendTesting.chromeDevtools. - Project Type Detection: Read
../autonomous-tests/references/project-type-detection.md. Scanpackage.jsonfiles in project root andrelatedProjects[]paths. Store inproject.frontendTypeandrelatedProjects[].frontendType. - External Service CLI Detection: Load
references/external-services-catalog.json. Scan CLAUDE.md files forclaudeMdKeywords. Per match: rundetectionCommand-> if available, runmodeDetection.command-> pattern-match -> populateallowedOperations/prohibitedFlags-> merge intoexternalServices[].
Agent reports back. Orchestrator writes to capabilities with lastScanned = UTC time (date -u).
Step 1.5: Tool Inventory — ALWAYS runs (no caching — tools change between sessions):
- Orchestrator directly (no agent spawn):
- Skills: Extract available skills from system-reminder context
- Agents: Extract available agent types from Agent tool description
- Delegate to Explore agent (combine with Step 1 if triggered, or spawn separately):
3. MCP servers: Run
mcp-find+ scan~/.claude/settings.jsonformcpServers. Include chrome-devtools-mcp in inventory if available. 4. CLIs: External service detection + probe common tools (which curl,which jq,which ngrok,which uvx) - Compile Tool Inventory: Per-phase recommendations (Safety: health-check CLIs; Discovery: Explore+Grep/Glob; Plan: skills/agents; Execution: service MCPs, CLI fallbacks, browser tools, DB tools, chrome-devtools-mcp)
CLAUDE.md deep scan (Phase 0 + Phase 2): find . -maxdepth 3 -name "CLAUDE.md" -type f + ~/.claude/CLAUDE.md + .claude/CLAUDE.md. Cache list for capabilities scan, auto-extract, Phase 2 enrichment, Feature Context Document.
Step 2: Config Check — test -f .claude/autonomous-tests.json && echo "CONFIG_EXISTS" || echo "CONFIG_MISSING". Schema: references/config-schema-swarm.json (extends base ../autonomous-tests/references/config-schema.json).
If CONFIG_EXISTS (returning run):
- Read config.
- Version validation: require
version: 6+ fieldsproject,database,testing,swarm. v5->v6: addfrontendType: "none",chromeDevtools: false,e2eUrl: null,browserPreference: "agent-browser", addlogFile: nullto services, addlogCommand: nullto relatedProjects, addfrontendIndicators, bump to 6. v4->v5->v6: chain migrations. <4 or missing fields -> warn, re-run first-run.- Missing
database.seedStrategy-> default"autonomous", inform user. - Missing
documentation.fixResults-> add"docs/_autonomous/fix-results". - Missing
swarm-> run Swarm Configuration Questionnaire.
- Missing
- Config trust: Compute hash using Config hash method. Check trust store
~/.claude/trusted-configs/{project-hash}.sha256. Mismatch -> show config (redacttestCredentialsvalues as"********") ->AskUserQuestionfor approval -> write hash. - Testing priorities: Show
userContext.testingPriorities.AskUserQuestion: "Pain points or priorities?" with "None" option. Update config. - Re-scan services: Delegate to Explore agent. Update config if needed.
date -u +"%Y-%m-%dT%H:%M:%SZ"-> updatelastRun.- Empty
userContext-> run questionnaire, save. - Re-stamp trust: if config modified -> recompute hash, write to trust store.
- Skip to Phase 1.
If CONFIG_MISSING (first run):
Spawn Explore agent for auto-extraction:
- Auto-extract from CLAUDE.md files + compose + env + package manifests. Detect migration/cleanup/seed commands. Detect DB type (MongoDB vs SQL). Both found -> ask user.
- Topology:
single|monorepo|multi-repo. - Related projects: scan sibling dirs, grep for external paths -> ask user per candidate -> populate
relatedProjects.
Agent reports. Orchestrator proceeds:
4. Capabilities scan — delegate (Step 1).
5. Seeding strategy via AskUserQuestion: autonomous (recommended) or command.
6. Swarm Configuration Questionnaire via AskUserQuestion: maxAgents (default 3), portRangeStart (default 9000), dockerContext (auto-detected), containerPrefix, compose vs raw-docker mode. Validate Docker context exists and port range is available.
7. User Context Questionnaire: flaky areas? credentials (env var/role names only)? priorities? notes? -> store in userContext.
8. Propose config -> STOP for confirmation -> write.
9. Stamp trust: compute hash -> write to trust store.
10. If CLAUDE.md < 140 lines and lacks startup instructions -> append max 10 lines.
Phase 1 — Safety, Environment & Log Monitoring
Objective: Verify the environment is safe, start services, reserve ports, and begin per-agent log capture.
Spawn ONE general-purpose subagent (foreground) to perform:
- Production scan:
.envfiles forproductionIndicators,*LIVE*SECRET*,NODE_ENV=production, production DB endpoints, non-local URLs. Show variable NAME only. - Run
sandboxCheckcommands from config. - Verify Docker is local. Validate
swarm.dockerContextexists:docker context inspect {dockerContext}. - Port reservation: Verify port range
portRangeStarttoportRangeStart + (maxAgents * portStep)is free.lsof -i :{port}orss -tlnpper port. Conflict -> warn + suggest alternative range. - Related project safety scan: For each
relatedProjects[]with apath, scan.envfiles for production indicators. Any production indicator -> ABORT. - Service startup: per service in config + related projects with
startCommand: health check -> healthy:already-running-> unhealthy: start + poll 5s/30s ->started-this-runor report failure. - Start webhook listeners: for each
externalServices[]withwebhookListener != null, run the command as a background process and record PID. - Log monitoring setup: After services are healthy, start log capture per
../autonomous-tests/references/log-monitoring-protocol.md(Swarm Variant):- Per-agent log paths at
/tmp/autonomous-swarm-{sessionId}/agent-{N}/logs/ - Create base directory structure for all agents
- Record log file paths and PIDs in Service Readiness Report
- Per-agent log paths at
- Service Readiness Report: per service — name, URL/port, health status, health check endpoint, source, log file path, log capture PID, assigned port range per agent.
Agent reports: safety assessment + Service Readiness Report with log paths. Gates: ABORT if production. STOP if unhealthy.
Phase 2 — Discovery
Fully autonomous — derive from code diff, codebase, or guided source. Never ask what to test.
Delegation: ONE Explore agent (subagent_type: "Explore", thoroughness: "medium").
Standard mode
- Changed files from git (scope args) — include
relatedProjects[].pathfor cross-project deps. - If
file:<path>-> read.md, extract features/criteria/endpoints/edge cases. - Spawn Explore agent with: changed files, file reference content,
relatedProjects[],testing.contextFiles, CLAUDE.md paths,documentation.*paths. Agent performs:- Feature map: API endpoints, DB ops, external services, business logic, auth flows, signal/event chains
- Dependency graph: callers -> changed code -> callees, cross-file/project imports
- Smart doc analysis: match paths/features against
docs/, scan_autonomous/(Summary + Issues Found only), fix completion scan - Edge case inventory: error handlers, validation branches, race conditions, retry logic
- Cross-project seed map: For each
relatedProjects[], trace which collections/tables are read by E2E flows - Test flow classification: Classify each scenario as
integration/api,e2e/webapp,e2e/mobile,guided/webapp,guided/mobile. Project type influences: iffrontendTypeiswebapp->e2e/webapp; ifmobile->e2e/mobile; ifapi-only-> onlyintegration/api. - Security checklist mapping: Read
../autonomous-tests/references/security-checklist.md. Map each of 17 items to YES/NO/PARTIAL per discovered feature. - Data seeding analysis: Read
../autonomous-tests/references/data-seeding-protocol.md. Analyze DB schema and models. Return seed plan with table names, field names, example values, create/cleanup commands. - Related project log commands: Discover log commands per
relatedProjects[].
- Receive structured report.
Guided mode (user augmentation)
Validate first: guided + staged/unstaged/N/working-tree -> STOP with combinability error.
- Resolve source:
guided file:<path>-> doc-based |guided "desc"-> description-based |guidedalone ->AskUserQuestion. - Spawn Explore agent with same inputs. Agent performs: deep feature analysis + same feature map/dependency/doc analysis/edge cases. Also identifies: DB seed requirements, external service setup needs, prerequisite state per happy-path workflow.
- Receive report. Orchestrator extracts only happy-path workflows — discard security, edge case, validation findings.
Regression Scope Analysis (conditional — after Explore report)
Check for re-test indicators in documentation.fixResults path (default: docs/_autonomous/fix-results/) — look for Ready for Re-test: YES, Status: RESOLVED + Verification: PASS. If found -> compile Targeted Regression Context Document (fix manifest, 1-hop impact zone, original test IDs, blast radius check >60% -> full scope).
Feature Context Document (standard/guided modes — skipped in regression mode)
Compile from Explore report (do NOT re-read files). Contents: features, endpoints, DB collections/tables, cross-project seed map, test flow classifications, security checklist applicability map, data seeding plan, log file paths from Service Readiness Report, related project log commands, external services, edge cases, test history, capabilities, swarm port mappings and Docker stack configuration. Guided mode adds Mode + Source at top. Cascaded to every Phase 4 agent.
Post-Discovery Prompts (standard mode only — skip if guided or regression)
MUST execute BEFORE entering plan mode — guided happy-path inclusion must be decided before the plan is written.
Single AskUserQuestion:
- Guided Happy-Path — After all autonomous tests and regression, generate guided test plan where user performs actions while agent verifies logs and DB per step. Happy-path only. Include? (yes/no)
Parse response into guidedHappyPathApproved boolean.
Phase 3 — Plan (Plan Mode)
Enter plan mode (/plan). Plan starts with:
Step 0 — Context Reload (for post-approval reconstruction):
- Re-read: SKILL.md, config,
../autonomous-tests/references/templates.md - Scope:
$ARGUMENTS, branch, commit range - Findings: Phase 2 discoveries (modules, endpoints, dependencies, test flow classifications)
- User context: flaky areas, priorities, notes
- Service Readiness Report from Phase 1 (including log file paths, port assignments)
- Swarm config:
maxAgents,portRangeStart,portStep,dockerContext, compose paths - If regression mode: fix manifest, 1-hop impact zone, original test IDs, Targeted Regression Context Document
- If guided: type, source, full guided test list with per-test seed requirements
- If guided happy-path approved: happy-path workflows with seed requirements, user instructions, verification queries
Tool loading gate: If plan includes e2e/webapp or e2e/mobile suites AND capabilities.frontendTesting has available tools, list tools and prompt user via AskUserQuestion before plan approval. Declined tools excluded from plan. Guided mode: NEVER include browser automation tools — skip this gate.
Self-containment mandate — the plan MUST embed directly (not reference "above" or prior phases):
- All test suites with full details (name, objective, pre-conditions, steps, expected outcomes, teardown, verification)
- Feature Context Document (condensed but complete)
- Service Readiness Report from Phase 1 (including log file paths, capture PIDs, port assignments)
- Per-suite agent spawn instructions with resolved values, port mappings, and Docker stack config
- Config paths:
documentation.*,database.connectionCommand,testing.unitTestCommand,testDataPrefix - Credential role names from
testCredentials - If guided: per-test DB seed commands, user-facing instructions, verification queries
- Seed schema discovery mandate (embedded verbatim) per
../autonomous-tests/references/data-seeding-protocol.md - If guided happy-path approved: Guided Happy-Path Decision block
- Documentation checklist (always — output directories, template path, filename convention, doc types this run produces)
- Tool Inventory from Phase 0
- DB Consistency Check Protocol from
../autonomous-tests/references/db-consistency-protocol.md - Security checklist applicability map (which of the 17 items apply to which features)
- Explicit data seeding plan (tables, fields, values, curl/DB commands)
- Log file paths from Service Readiness Report
- Chrome DevTools protocol (if available) from
../autonomous-tests/references/chrome-devtools-protocol.md - Service startup commands for all project + relatedProject services
- Execution protocols from
references/execution-protocols-swarm.md— embed relevant protocols verbatim - Suite agent tasks from
references/swarm-agent-tasks.md— embed lifecycle tasks verbatim - Task Tracking Block — create all phase tasks at execution start (embed verbatim):
IMMEDIATELY after plan approval and context reload, create tasks: TaskCreate("Phase 4.1: Service Restoration", "Re-establish services, log captures, per-agent directories post-reset") TaskCreate("Phase 4.2: Setup", "Read source files, compile Feature Context Documents, generate per-agent specs") TaskCreate("Phase 4.3: Integration Suites (Parallel)", "Execute integration suites in PARALLEL via background subagents") TaskCreate("Phase 4.4: E2E Suites", "Execute E2E suites SEQUENTIALLY against shared local stack") — skip if frontendType == api-only TaskCreate("Phase 4.5: Regression", "Run unit tests via testing.unitTestCommand") TaskCreate("Phase 4.6: Guided Happy-Path", "User-augmented verification per scenario") — skip if not approved For each happy-path scenario [M]: TaskCreate("Phase 4.6.[M]: [Scenario Name]", "[scenario description]") TaskCreate("Phase 5: Results & Docs", "Audit merge, fix cycles, documentation, Docker cleanup") Chain: 4.1 -> 4.2 -> 4.3 -> 4.4 -> 4.5 -> 4.6 -> 5 (sequential via addBlockedBy) Before each phase: TaskUpdate(id, status: "in_progress") After each phase: TaskUpdate(id, status: "completed") - Phase Orchestration Protocol from
references/execution-protocols-swarm.md— embed verbatim. Contains concrete Agent() spawn templates for every phase including parallel integration spawning.
Test Plan Structure
## Test Plan
### Integration Test Suites (curl-based) — PARALLEL EXECUTION
Categories 1-8 per ../autonomous-tests/references/test-taxonomy.md. Each suite includes:
- Explicit curl commands with expected responses (remapped ports per agent)
- Security checklist items applicable to this suite (YES/PARTIAL only)
- Data seeding commands (what to create before, what to clean after)
- Per-agent log file paths
- Docker stack assignment (swarm-{N}, ports, compose path)
### E2E Test Suites (browser-based) — SEQUENTIAL EXECUTION
Only if frontendType != api-only. Runs against shared local stack (NOT swarm-isolated).
For webapp: agent-browser flows with chrome-devtools-mcp observations
For mobile: guided steps with verification commands
### Regression (unit tests — runs LAST)
Single testing.unitTestCommand execution after all suites complete.
Regression mode scoping: When Targeted Regression Context Document is present:
- Suite 1 "Fix Verification": one test per fixed item — re-execute original failure scenario
- Suite 2 "Impact Zone" (conditional): tests for 1-hop callers/callees
- No other suites. Execution protocol unchanged.
Pre-approval validation: Before presenting the plan, verify all self-containment items are present. Missing items -> add before prompting.
Wait for user approval.
Phase 4 — Execution (Subagents)
First: Create all phase tasks per the Task Tracking Block embedded in the plan. Chain dependencies.
Then: Follow the Phase Orchestration Protocol embedded in the plan for concrete Agent() spawn templates. Each phase: TaskUpdate -> spawn subagent(s) with detailed context -> receive report(s) -> TaskUpdate completed.
Integration suites use parallel background subagents per the protocol. E2E/regression use sequential foreground subagents. Guided happy-path runs in the main conversation. Each parallel agent follows the lifecycle in references/swarm-agent-tasks.md.
1. Service Restoration Agent (fg, FIRST)
Context reset kills background processes. Re-establish services and log captures:
- Run
healthCheckper service — healthy ->verified-post-reset— unhealthy ->startCommand+ poll 5s/30s - Related projects: same check via
relatedProjects[] - Start webhook listeners
- Re-start log captures per
../autonomous-tests/references/log-monitoring-protocol.md(Swarm Variant) for all services - Recreate per-agent directories at
/tmp/autonomous-swarm-{sessionId}/agent-{N}/logs/ - Gate: any
failed-post-reset-> STOP - Return updated Service Readiness Report with new log file paths and PIDs
2. Setup Agent (fg)
Read source files, compile Feature Context Documents, read CLAUDE.md files. Generate per-agent specs (swarm-{N}, remapped ports, compose paths). Freeze capabilities snapshot for distribution. Proceeds after completion.
3. Integration Suite Agents (bg, PARALLEL — up to maxAgents concurrent)
Each agent spawned with run_in_background: true. Each receives:
- Pre-generated spec (swarm-{N}, ports, compose path)
- Frozen capabilities snapshot
- Feature Context Document with remapped curl commands
- Security checklist subset (YES/PARTIAL items only)
- Data seeding instructions per
../autonomous-tests/references/data-seeding-protocol.md - Per-agent log paths at
/tmp/autonomous-swarm-{sessionId}/agent-{N}/logs/ - DB consistency protocol, credential role name, Tool Inventory subset
Each agent executes lifecycle tasks a-l from references/swarm-agent-tasks.md within its isolated Docker stack. Failure -> redistribute to replacement background subagent.
Orchestrator waits for ALL parallel agents to complete, then performs audit merge per references/execution-protocols-swarm.md (Audit Merge Protocol).
4. E2E Suite Agents (fg, SEQUENTIAL — one at a time)
Only if frontendType != api-only. Runs against shared local stack (NOT swarm-isolated).
Each agent receives: user journey steps, browser config, chrome-devtools protocol (if chromeDevtools: true), log paths, Service Readiness Report, Feature Context Document, Tool Inventory subset.
Webapp: Navigate with agent-browser -> snapshot -> execute journey -> re-snapshot + devtools check -> verify backend via curl -> verify DB.
Mobile: Present guided steps via AskUserQuestion -> user acts -> verify via curl/DB/logs.
Browser tool priority (skipping without attempting is PROHIBITED):
agent-browser(PRIMARY) —open <url>->snapshot -i->click/fill @ref-> re-snapshot- Playwright (FALLBACK) — if agent-browser unavailable/errors
- Direct HTTP/API (LAST RESORT) — mark untestable parts as "guided"
Reports back: PASS/FAIL per step, screenshots, network/console findings, backend verification, logs.
5. Regression Agent (fg, after autonomous suites)
Run testing.unitTestCommand once. Report total/passed/failed/skipped. Never interleaved with other suites.
6. Guided Happy-Path (MAIN CONVERSATION, optional, if user approved — LAST)
Runs LAST — after all autonomous suites and regression. No browser automation. Category 1 only. Swarm isolation NOT used — shared local stack.
Each happy-path scenario is a separate task (Phase 4.6.{M}).
Per-scenario flow (orchestrator drives — NOT delegated to a single subagent):
- TaskUpdate(scenario task, status: "in_progress")
- Seed: Spawn subagent (fg) to seed DB and set up prerequisites for this scenario
- Per-step loop (for each step in the scenario):
a. Orchestrator presents step via
AskUserQuestion(MANDATORY — text output PROHIBITED): Options:["Done - ready to verify", "Skip this test", "Issue encountered"]b. If "Done": spawn verification subagent (fg) that:- Checks service logs since step start (grep ERROR/WARN --since timestamp)
- Runs DB queries to verify expected state changes from this step
- Runs API verification calls if applicable
- Reports: PASS/FAIL with log findings + DB state analysis c. If verification FAIL: record finding, continue to next step (do not halt scenario) d. If "Skip": record SKIPPED, continue e. If "Issue encountered": record user's description, continue
- After all steps: TaskUpdate(scenario task, status: "completed")
- Next scenario
Critical Execution Rules
- Never create batch scripts — each test explicitly passed to subagent
- Explicit data seeding instructions (table, fields, values, commands) — never guess
- Each subagent receives applicable security checklist items only
- Each subagent receives per-agent log file paths to check after execution
- Orchestrator checks logs between suites and after parallel completion
- Credential assignment: Rotate role names from
testCredentialsacross suites (round-robin: suite 1 gets role A, suite 2 gets role B, wraps to role A if more suites than roles) — pass role name only, never values - Finding verification (mandatory): identify source code -> read to confirm -> distinguish real vs agent-created -> report only confirmed. Unconfirmed ->
Severity: Unverifiedin### Unverified - Anomaly detection: duplicate records, unexpected DB changes, warning/error logs, slow queries, orphaned references, auth anomalies, unexpected response fields/status codes
- External CLI guard: Before any CLI command from
externalServices[], verify the subcommand is inallowedOperationsand noprohibitedFlagsare present. Reject non-matching commands. - Integration suites: PARALLEL (
run_in_background: true), up tomaxAgentsconcurrent - E2E suites: SEQUENTIAL (fg, one at a time) — browser automation cannot parallelize
- Guided mode: all suites SEQUENTIAL (no parallel). Guided happy-path runs in main conversation (orchestrator presents steps, subagents seed and verify)
Phase 5 — Results & Docs
Fix cycle: Runtime-fixable issues -> verify real (re-read error, retry once, confirm root cause) -> spawn subagent to fix -> re-run suite -> max 3 cycles. Code bug -> document + ask user.
Audit merge: If not done in Phase 4, merge all parallel agent results per references/execution-protocols-swarm.md (Audit Merge Protocol). Consolidate findings, security observations, DB consistency, log findings across all agents.
Documentation: Spawn subagent (foreground). Timestamp via date -u +"%Y-%m-%d-%H-%M-%S". Pattern: {timestamp}_{semantic-name}.md. Read ../autonomous-tests/references/templates.md. Four doc types: test-results (always — rename header to "Test Results"), pending-fixes (bugs/infra), pending-guided-tests (browser/visual/physical), pending-autonomous-tests (identified but not run). Include service log analysis, DB consistency results (if WARN/FAIL). When audit enabled -> append "Execution Audit" section (agent count, durations, limits, cleanup). Include ALL results from autonomous + guided phases.
Docker cleanup verification: Spawn subagent (foreground) per references/execution-protocols-swarm.md (Docker Cleanup Verification):
- Verify no lingering swarm containers, networks, or volumes
- Clean orphans if any remain
- Remove
/tmp/autonomous-swarm-{sessionId}/
Cleanup: Remove testDataPrefix data only. Never touch pre-existing. Kill log capture processes. Verify cleanup. Log actions.
DB consistency final check: Run POST_CLEANUP verification. Zero test records must remain.
Phase 6 — Finalize
Important: Run
/clearbefore invoking another skill (e.g.,/autonomous-fixes) to free context window tokens and prevent stale state from interfering with the next operation.
Rules
| Rule | Detail |
|---|---|
| No production | Never modify production data or connect to production services |
| No credentials in output | Never expose credentials, keys, tokens, or env var values — pass role names only |
| Plan before execution | Phase 3 plan mode required before any test execution |
| Subagents only | All execution via Agent(). Main-conversation execution PROHIBITED |
| Model inheritance | Subagents inherit from main conversation — ensure Opus is set |
| Integration = parallel | Integration suites run in PARALLEL via run_in_background: true, up to maxAgents concurrent |
| E2E = sequential | E2E suites run ONE at a time — browser automation cannot parallelize |
| Guided = all sequential | Guided mode overrides parallel — all suites sequential |
| Isolated Docker stacks | Each parallel agent gets its own Docker stack with remapped ports and namespaced containers |
| Port cleanup mandatory | All swarm ports must be freed after execution — no lingering binds |
| Docker cleanup mandatory | All swarm containers, networks, volumes removed after execution — no orphans |
| Explore agents read-only | No file edits or state-modifying commands |
| UTC timestamps | Via date -u only, never guess |
| No unsafe MCPs | Never activate safe: false MCPs |
| External CLI gating | Blocked when cli.blocked. Per-run user confirmation. allowedOperations only |
| No dynamic commands | Only execute verbatim config commands — no generation/concatenation/interpolation |
| Integration tests = curl | Always curl — never mock, never script-based test runners |
| E2E tests = browser | agent-browser (primary) or Playwright (fallback) — never test production |
| Unit tests = regression | Run ONCE at the end — never during integration/E2E suites |
| Data seeding = explicit | Never guess field names, values, or schemas. Seed schema discovery mandatory |
| Each test = one subagent | Never batch scripts — each test passed individually to subagent |
| Service logs monitored | Log capture active during all test phases. Check between suites |
| Finding verification | Verify against source code before reporting any finding |
| Idempotent test data | Prefix with testDataPrefix. Skip or reset if exists |
| External API care | Delays between calls, sandbox modes, minimize requests |
_autonomous/ reading |
Summary + Issues Found sections only |
| Capabilities auto-detected | Never ask user to configure manually |
| Guided = user augmentation | No browser automation in guided mode — user performs all actions |
| Guided = happy-path only | Category 1 only in guided mode — categories 2-8 autonomous-only |
| Tool loading gate | Browser tools need pre-plan approval in autonomous mode, never in guided |
| Plan self-containment | All context embedded in plan for post-reset survival — no "see above" references |
| Guided happy-path = post-all | Guided happy-path runs last — after all autonomous suites AND regression |
| Post-discovery prompts | Standard mode only — skipped when guided arg or regression mode active |
| Documentation in every run | Test-results doc generated for every run. Embedded in plan execution protocol |
| DB consistency inline | POST_SEED, POST_TEST, POST_CLEANUP checks within Phase 4 per suite |
| Audit merge before docs | Parallel results merged before documentation phase |
| Task tracking per phase | TaskCreate for all execution phases at plan start. TaskUpdate in_progress/completed around each. Phase Orchestration Protocol embedded in plan |
| Guided = main conversation | Orchestrator presents steps via AskUserQuestion. Subagents seed DB and verify logs/DB per step. NOT delegated to a single subagent |
| Guided = per-scenario tasks | Each happy-path scenario gets its own task (Phase 4.6.{M}) |
Operational Bounds
| Bound | Constraint |
|---|---|
| Max parallel agents | swarm.maxAgents (default 5) |
| Max agents total | Approved test suites + service restoration + setup + regression |
| Max fix cycles | 3 per suite |
| Health check timeout | 30s per service (60s for swarm stacks) |
| Capability cache | rescanThresholdDays (default 7 days) |
| Command scope | User-approved config commands only |
| Docker scope | Local only — Phase 1 aborts on production indicators |
| Docker context | swarm.dockerContext validated in Phase 1 |
| Port range | portRangeStart to portRangeStart + (maxAgents * portStep) — validated free |
| Credential scope | Env var references only — raw values forbidden, redacted on display |
| MCP scope | safe: true only |
| Subagent lifecycle | Integration: parallel bg. E2E/guided/regression: one fg at a time |
| Explore agent scope | One per Phase 2. Read-only |
| External CLI scope | allowedOperations only. Per-run confirmation. Blocked when cli.blocked |
| System commands | which, docker compose ps, docker context inspect/show, git branch/diff/log, test -f, find . -maxdepth 3 -name "CLAUDE.md" -type f, date -u, curl -sf localhost, python3 -c json/hashlib, lsof/ss for port checks |
| External downloads | Docker images via user's compose only. Playwright browsers if present. No other downloads |
| Data access | Outside project: ~/.claude/settings.json (RO), ~/.claude/trusted-configs/ (RW), ~/.claude/CLAUDE.md (RO). .env scanned for patterns only |
| Trust boundaries | Config SHA-256 verified out-of-repo. Untrusted inputs -> analysis only -> plan -> user approval |
| Guided happy-path scope | Category 1 only. No browser automation. Sequential. Runs last |
| Documentation output | Minimum 1 doc (test-results) per run. Embedded in execution protocol for post-reset survival |
| Temp directory | /tmp/autonomous-swarm-{sessionId}/ — removed in Phase 5 cleanup |