dag-fleet
Fleet
A skill for running parallel claude -p or codex exec workers in tmux with budgets and DAG dependencies. Supports both Claude and Codex providers — set per-fleet or per-worker. Operator owns all kill / steer / re-direction — there is no auto-restart, no auto-verify, no babysitter loop.
When to use this skill (and when NOT to)
FIRST: prefer Claude Code's built-in Agent tool when any of these are true. It's simpler, faster, and avoids the fleet machinery entirely.
- The work fits inside the current conversation
- All sub-agents will finish in under 10 minutes
- You'll synthesize the results in the same session
- You don't need budget caps or dependency ordering
Reach for THIS skill only when ≥1 of these is true:
- Persistence: the run will outlive the parent
claudeprocess (e.g. multi-hour fleet, user closes laptop) - Per-worker budgets: you need
max_budget_usd: Nenforced per worker - DAG dependencies: worker D must wait for A, B, C to finish first
- Mixed models per worker: Sonnet researchers + Haiku validators in the same fleet
- Tmux pane visibility: the user wants to attach to individual workers and watch them stream
If none of those apply, stop reading this skill and use the Agent tool.
What this skill is NOT
- Not an auto-recovery system. If a worker fails or hangs, the operator decides what to do.
- Not a babysitter. There is no
orchestrate.sh, no stuck-detection that kills, no mid-flight steering. - Not for "spawn 3 quick lookups in parallel" — that's the Agent tool's job.
Available scripts
| Script | When to call | Args |
|---|---|---|
launch.sh |
Start a new fleet from a fleet.json you generated |
<fleet-root> |
status.sh |
Show what's running, what's done, live cost, last message per worker | <fleet-name-or-root> [-v] [--watch] [--json] |
kill.sh |
Stop one worker or the entire fleet (operator's hard stop) | <fleet-name-or-root> <worker-id>|all [--force] |
relaunch-worker.sh |
After editing one worker's prompt.md, re-run just that worker |
<fleet-name-or-root> <worker-id> |
report.sh |
Generate a markdown summary when the fleet is done | <fleet-root> |
view.sh |
Capture a single worker's tmux pane content | <fleet-name-or-root> <worker-id> |
feed.sh |
Stream a unified event feed across all workers | <fleet-name-or-root> [--agent <id>] |
Utilities (in lib/):
| Utility | Purpose | Usage |
|---|---|---|
dag-viz.py |
Visualize fleet DAG structure (ASCII or mermaid) | python3 ${CLAUDE_SKILL_DIR}/lib/dag-viz.py <fleet.json> [--mermaid] |
All scripts accept either an absolute fleet-root path or a fleet name (resolved via ~/.claude/fleet-registry.json, populated automatically by launch.sh).
Launch procedure (MUST follow exactly)
When the user asks you to launch a fleet:
- Set
FLEET_ROOTto the user's specified directory. Default to cwd if unspecified. Use absolute paths only. mkdir -p $FLEET_ROOT/workers- Generate
$FLEET_ROOT/fleet.json— seereferences/fleet-json-schema.mdfor the full schema. Required top-level fields:fleet_name,config,workers[]. Each worker needsid,type,task,model,max_turns,max_budget_usd. Usedepends_on: [...]for DAG ordering. - For each worker, create
$FLEET_ROOT/workers/{id}/prompt.md. The prompt MUST include this line verbatim:
(Substitute the real fleet root and worker id.)Save ALL output files to $FLEET_ROOT/workers/{id}/output/ — use absolute paths. - Run:
bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh $FLEET_ROOT - Do NOT write your own tmux/claude commands.
launch.shhandles topo sort, tmux session creation, per-worker spawning, budgets, and the registry. - ALWAYS tell the user the exact status command so they can monitor manually:
This is mandatory after every launch. The user must be able to check status without asking you.bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-name-or-root>
Re-running ONE worker (the addendum workflow)
The user has a finished fleet and wants to add 1-2 sources / change one worker's instructions:
- Edit
$FLEET_ROOT/workers/{id}/prompt.md(add the new sources / instructions) - Run
bash ${CLAUDE_SKILL_DIR}/scripts/relaunch-worker.sh <fleet-name> {id} - The worker's old
session.jsonlis rotated to.bak, a fresh tmux window spawns, other workers are untouched - The fleet's tmux session must still exist. If it's been killed, the user must
launch.sh --force-relaunchthe whole fleet —relaunch-worker.shonly works against a live fleet session.
If the user wants to re-run multiple workers, do it one at a time. There is no batch re-run; that's intentional.
Killing
There are two operator-initiated kill paths and no automatic kills:
kill.sh <fleet> <worker-id>— kill one worker. Sweeps subprocess descendants. Use this when you've decided a single worker is going down the wrong path.kill.sh <fleet> all --force— tear down the entire fleet, kill all tmux windows, sweep every orphan subprocess, mark workers KILLED, unregister from the registry.
There is no steer.sh. There is no mid-flight redirection. The intentional workflow for "I want this worker to take a different direction" is: kill.sh it, edit prompt.md, relaunch-worker.sh. Three steps, fully under operator control.
Worker types
The type field on each worker controls the --disallowed-tools set passed to claude. Pick one:
read-only— disallows: Bash, Edit, Write, Agent, WebFetch, WebSearch. Cannot write files. Only use for pure analysis where output is captured from assistant messages in session.jsonl.write— disallows: Bash, Agent, WebFetch, WebSearch. Use for synthesizers and any worker that writes output files.code-run— disallows: Agent, WebFetch, WebSearch (the typical default for build/test workers)research— disallows: Bash, Edit, Agent (web access enabled). Use for researchers, notread-only.reviewer— disallows: Bash, Edit, Agent, WebFetch, WebSearch. Has Read + Write only. Use for reviewers that write verdict files.orchestrator— disallows: Agent, WebFetch, WebSearch, Edit
WARNING: read-only cannot write files. If a worker needs to save output (findings.md, synthesis.md, etc.), use write, research, reviewer, or code-run. Setting a synthesizer to read-only will burn its entire budget trying to find a Write tool.
See references/worker-types.md for the full permission matrix.
Provider support (Claude + Codex)
Workers can run on either claude (default) or codex (OpenAI Codex CLI). Set at fleet level or per-worker:
{
"config": {
"provider": "codex",
"model": "gpt-5.4",
"reasoning_effort": "medium"
},
"workers": [
{ "id": "researcher", "type": "research", "provider": "codex", "model": "gpt-5.4", "reasoning_effort": "medium" },
{ "id": "writer", "type": "write", "provider": "claude", "model": "sonnet" }
]
}
Codex-specific fields
| Field | Values | Default | Scope |
|---|---|---|---|
provider |
"claude" | "codex" |
"claude" |
config + per-worker |
reasoning_effort |
"low" | "medium" | "high" |
(none) | config + per-worker, codex only |
Codex model aliases
| Model | Use case |
|---|---|
gpt-5.4 |
Flagship — strongest reasoning, recommended default |
gpt-5.4-mini |
Fast/cheap — validators, simple tasks |
gpt-5.3-codex |
Coding-focused (migrating to gpt-5.4) |
Codex limitations vs Claude
- No
--max-budget-usd— codex has no per-worker budget cap. Fleet-level cost tracking still works (estimated from token counts). - No
--fallback-model— codex has no automatic model fallback. - No per-tool disabling — codex uses sandbox modes (
read-only,workspace-write) instead of--disallowed-tools. Worker types are mapped automatically. - Web search — research workers get
-c 'web_search="live"'automatically. - All output workers need
workspace-write— codexread-onlysandbox blocks ALL file writes including output.
DAG dependencies
{
"id": "synthesizer",
"depends_on": ["researcher-01", "researcher-02"]
}
launch.sh uses the shared lib/dag.sh primitives (Kahn's BFS-layered topo-sort) to order workers and waits for dependencies to emit a terminal result event before starting dependents. Cycles are detected before any tmux state is created — fleet exits 2 with CYCLE:a,b,... on stderr. Workers within a layer run in parallel up to max_concurrent. Use dag-viz.py to preview the DAG structure before launch.
Budgets
worker.max_budget_usd: N— per-worker hard cap, passed toclaude --max-budget-usdconfig.max_budget_fleet: N— total fleet cap;launch.shstops launching new workers once this is exceeded (already-running workers are not killed, the cap is "no new spending")
STRICT RULES
- ALWAYS use the scripts above for EVERY operation. Never write your own tmux / claude commands.
- NEVER use the
--bareflag withclaude— causes auth failures. - Fleet root = user's directory. Default to cwd. ALL fleet files go inside
$FLEET_ROOT. - Worker output paths must be absolute:
$FLEET_ROOT/workers/{id}/output/. Tell the worker this in its prompt.md. launch.shis the only way to start workers.relaunch-worker.shis the only way to selectively re-run one. There is no other path.- Operator owns kill and direction changes. Do not auto-kill, do not auto-restart, do not auto-redirect. If a worker is misbehaving, surface it to the user and let them decide.
- Do NOT invent missing scripts. If you find yourself wanting
steer.sh,verify.sh,add-worker.sh, ororchestrate.sh— they were intentionally removed. Use the operator-owned workflow above instead.
Rationalizations to reject
| Agent says | Rebuttal |
|---|---|
| "The task is small enough that I can write the tmux commands myself" | The skill exists to prevent the 15 things you'll forget (unset CLAUDECODE, --disallowed-tools, session naming, registry, topo sort). Use launch.sh. |
"I'll use relaunch-worker.sh to restart all stuck workers at once" |
One at a time, intentional. Batch restart is how experiment 001 burned $20 — cache rebuilds on every worker compounded. |
| "The worker seems stuck — I should kill and restart it" | Long thinking blocks look like hangs. Check status.sh or view.sh first. Only the operator kills workers. |
| "I should add a verify step after each worker finishes" | There is no verify step. The operator reads output and decides. Auto-verify was removed after it caused more harm than the failures it caught. |
"I'll just add --bare to speed things up" |
--bare causes auth failures. Never use it. This is STRICT RULE #2. |
When to give up on this skill
If the user asks for behavior that requires auto-recovery, mid-flight steering, or per-worker validation loops, tell them this skill no longer does those things by design. Suggest:
- For auto-recovery → they should run a watcher script themselves and call
kill.sh+relaunch-worker.shfrom it - For mid-flight steering → kill + edit prompt.md + relaunch-worker
- For validation → they read the output files themselves and decide
The skill's surface area was deliberately reduced after experiments where automated behavior caused more harm than the failures it was trying to recover from.
$ARGUMENTS