long-run
Long Run Harness
Orchestrates multi-day execution of complex tasks through a milestone pipeline. Each milestone passes through plan-crafting → run-plan → review-work with checkpoints between milestones for recovery from interruptions.
Core Principle
Long-running execution must be resumable, auditable, and fail-safe. Every state transition is persisted to disk before the next action begins. If execution stops for any reason — rate limit, crash, user pause, context loss — it can resume from the last checkpoint without repeating completed work.
Hard Gates
- Milestones must exist before execution. Either from
milestone-planningskill or user-provided. Never generate milestones inline during execution. - State file must be updated before and after every milestone. No in-memory-only state. If it's not on disk, it didn't happen.
- Each milestone must complete the full pipeline. plan-crafting → run-plan → review-work. No shortcuts. No skipping review-work "because it looked fine."
- Failed milestones block dependents. If M2 depends on M1 and M1 fails review, M2 does not start. Period.
- User confirmation required at gate points. Before starting a new milestone phase (planning, execution, review), check if the user wants to continue, pause, or abort.
- Never modify completed milestones. Once a milestone passes review-work, its files are locked. If a later milestone needs changes to earlier work, that is a new milestone.
- Checkpoint after every milestone completion. Write a checkpoint file recording what was done, test results, and review verdict before proceeding.
When To Use
- After
milestone-planninghas produced a milestone DAG - When the user says "long run", "start long run", "execute milestones", or "run all milestones"
- When resuming a previously paused long run session
When NOT To Use
- When milestones don't exist yet (use
milestone-planningfirst) - When there's only one milestone (use plan-crafting + run-plan directly)
- For quick tasks that don't warrant multi-phase execution
Input
- Harness state directory path — e.g.,
docs/engineering-discipline/harness/<session-slug>/ - The directory must contain
state.mdandmilestones/*.mdfiles
If no state directory exists, ask the user if they want to run milestone-planning first.
Process
Phase 1: Load and Validate State
- Read
state.mdfrom the harness directory - Read all milestone files from
milestones/ - Validate:
- All milestones referenced in state.md have corresponding files
- Dependency DAG is valid (no cycles, topological sort possible)
- No milestone is in an invalid state (e.g., "executing" without a plan file)
- Determine current position:
- Which milestones are completed?
- Which milestones are ready to start (all dependencies met)?
- Is this a fresh start or a resume?
- Present status to the user:
## Long Run Status: [Session Name]
**Progress:** N/M milestones completed
**Current phase:** [planning M3 | executing M3 | reviewing M3 | ready to start M3]
**Next up:** [M3, M4 (parallel)]
Completed: M1 ✓, M2 ✓
In progress: M3 (executing)
Pending: M4, M5
- Ask user to confirm: continue, pause, or abort.
Phase 2: Milestone Execution Loop
For each milestone in topological order:
┌─────────────────────────────────────┐
│ Milestone Pipeline │
│ │
│ ┌──────────┐ ┌─────────┐ │
│ │ Plan │───→│ Run │ │
│ │ Crafting │ │ Plan │ │
│ └──────────┘ └────┬────┘ │
│ │ │
│ ┌────▼────┐ │
│ │ Review │ │
│ │ Work │ │
│ └────┬────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ PASS? │ │
│ │ Yes → checkpoint│ │
│ │ No → retry │ │
│ └─────────────────┘ │
└─────────────────────────────────────┘
Step 2-1: Gate Check
Before starting a milestone:
- Verify all dependency milestones have status
completed - Verify no file conflicts with in-progress parallel milestones
- Update state.md: set milestone status to
planning - Update execution log with timestamp
Step 2-2: Plan Crafting Phase
- Compose a Context Brief from the milestone definition:
- Goal → from milestone file
- Scope → files affected from milestone file
- Success Criteria → from milestone file
- Constraints → inherited from the parent problem + completed milestone context
- Completed milestone context contract: From each completed predecessor, include ONLY:
- Files created/modified (from checkpoint's "Files Changed" list)
- Interface contracts established (function signatures, API shapes, type definitions)
- Success criteria that were verified as met
- Do NOT include: execution logs, review documents, worker/validator output, or full checkpoint contents
- Note: Context Briefs composed from milestone definitions omit the Complexity Assessment section, since routing has already been determined by the milestone-planning phase. The brief goes directly to plan-crafting without re-routing.
- Invoke the
plan-craftingskill pattern:- Create a plan document at
docs/engineering-discipline/plans/YYYY-MM-DD-<milestone-name>.md - The plan must satisfy all milestone success criteria
- The plan must not modify files outside the milestone's scope
- Create a plan document at
- Update state.md: record plan file path for this milestone
- User gate: Present the plan and ask for approval before execution
Step 2-3: Run Plan Phase
- Update state.md: set milestone status to
executing, incrementAttemptscounter by 1 - Execute the plan using the
run-planskill pattern:- Worker-validator loop for each task
- Parallel execution for independent tasks
- Information-isolated validators
- If run-plan reports failure after 3 retries on any task:
- Update state.md: set milestone status to
failed - Record failure details in execution log
- Stop and report to user. Do not proceed to dependent milestones.
- Update state.md: set milestone status to
- If all tasks complete: proceed to review phase
Step 2-4: Review Work Phase
- Update state.md: set milestone status to
validating - Invoke the
review-workskill pattern:- Information-isolated review against the plan document
- Binary PASS/FAIL verdict
- If PASS:
- Update state.md: set milestone status to
completed - Write checkpoint file (see Checkpoint Format below)
- Update execution log
- Proceed to next milestone
- Update state.md: set milestone status to
- If FAIL:
- Record review findings in execution log
- Retry decision (based on
Attemptscounter in state.md, which persists across crashes):- If Attempts == 1: return to Step 2-3 with review feedback (re-execute same plan)
- If Attempts == 2: return to Step 2-2 (re-plan with review feedback as constraint)
- If Attempts >= 3: set status to
failed, stop, report to user
Step 2-5: Cross-Milestone Integration Check
After a milestone passes review-work but before writing the checkpoint, verify that the milestone's output integrates correctly with all previously completed milestones:
- Run the project's highest-level verification (from state.md's Verification Strategy or rediscover using plan-crafting's Verification Discovery order)
- Check cross-milestone interfaces: If the completed milestone defines or consumes interfaces from predecessor milestones, verify they are compatible (function signatures match, API contracts hold, types align)
If integration check passes: Proceed to checkpoint.
If integration check fails — Cross-Milestone Failure Response:
The milestone passed its own review-work (internal correctness) but breaks integration with other milestones. This is a boundary problem.
-
Diagnose (attempt 1):
- Read the failure output
- Identify which interface boundary or interaction is broken
- Determine if the fix belongs to the current milestone or requires a corrective milestone
- If fixable within current milestone scope: dispatch a targeted fix worker → re-run review-work → re-run integration check
- If the fix is outside current milestone scope: proceed to escalation
-
Diagnose (attempt 2):
- If the first fix didn't resolve it, re-analyze
- Apply a second targeted fix
- Re-run integration check
-
Escalate to user (after 2 failed attempts):
- Report: which milestones are involved, what integration boundary failed, what fixes were tried
- Options: add corrective milestone, rollback to checkpoint, accept and continue (user acknowledges the integration gap)
- Log the user's decision in state.md execution log
Step 2-6: Checkpoint
After a milestone passes review:
Write checkpoints/M<N>-checkpoint.md:
# Checkpoint: M<N> — [Milestone Name]
**Completed:** YYYY-MM-DD HH:MM
**Duration:** [time from planning start to review pass]
**Attempts:** [number of plan-execute-review cycles]
## Plan File
`docs/engineering-discipline/plans/YYYY-MM-DD-<name>.md`
## Review File
`docs/engineering-discipline/reviews/YYYY-MM-DD-<name>-review.md`
## Test Results
[Full test suite status at checkpoint time]
## Files Changed
[List of files created/modified in this milestone]
## State After Milestone
[Brief description of system state — what works now that didn't before]
Phase 3: Parallel Milestone Execution
When multiple milestones have all dependencies satisfied and no file conflicts:
- Identify parallelizable milestone group
- Run plan-crafting for ALL parallel milestones first (sequentially — plans are lightweight)
- Present ALL plans together for batch approval: "Milestones M3 and M4 can run in parallel. Here are both plans. Approve each individually."
- User approves or rejects each plan independently. Only approved milestones proceed to execution. Rejected milestones return to Step 2-2 while approved ones execute.
- If all approved, dispatch each milestone's pipeline concurrently:
- Each milestone runs run-plan → review-work (plan already approved in step 3)
- Each runs in a worktree (
isolation: "worktree") to prevent file conflicts - After both complete and pass review, merge worktrees back
- If either fails: handle independently (the other can continue if no dependency)
Worktree merge protocol:
- Both milestones pass review in their respective worktrees
- Check for file conflicts between worktree changes
- If no conflicts: merge sequentially (M_lower first, then M_higher)
- If conflicts detected: stop, report to user, request manual resolution
- After merge: run full test suite on merged result
- If tests fail: stop, report to user
Phase 4: Completion
After all milestones are completed (including the Integration Verification Milestone from milestone-planning):
- Update state.md: set overall status to
completing - Final E2E Gate: Run the project's highest-level verification one final time on the fully integrated codebase
- Run full test suite for regression check
- If Final E2E Gate fails:
- Diagnose: identify which milestone's output is the likely cause
- Create a corrective milestone via Mid-Execution Correction procedure
- Execute corrective milestone through the full pipeline (plan-crafting → run-plan → review-work)
- Re-run E2E Gate after correction
- If 2 corrective attempts fail: escalate to user with full diagnosis
- If Final E2E Gate passes: Update state.md: set overall status to
completed - Generate completion summary:
# Long Run Complete: [Session Name]
**Started:** YYYY-MM-DD
**Completed:** YYYY-MM-DD
**Total milestones:** N
**Total attempts:** [sum of all milestone attempts]
## Milestone Summary
| Milestone | Status | Attempts | Duration |
|-----------|--------|----------|----------|
| M1: [name] | ✓ completed | 1 | 2h |
| M2: [name] | ✓ completed | 2 | 4h |
| ...
## Final Test Suite
[PASS/FAIL — N passed, M failed]
## Files Changed (Total)
[Aggregated list across all milestones]
- Present to user and suggest
simplifyfor a final code quality pass
Recovery Protocol
When resuming a paused or interrupted session:
- Read state.md to determine last known state
- For each milestone, determine recovery action:
| Last Status | Recovery Action |
|---|---|
pending |
Start normally |
planning |
Restart plan-crafting (plan file may be incomplete) |
executing |
Check run-plan progress; resume or restart |
validating |
Restart review-work (review may be incomplete) |
completed |
Skip (already checkpointed) |
failed |
Present failure to user; ask whether to retry or skip (see Skip Rules below) |
skipped |
Skip (user previously chose to skip this milestone) |
- For
executingmilestones: check if tasks in the plan have checkboxes marked. Resume from the first unchecked task. - Read the
Attemptscounter from state.md to determine retry budget remaining. Do not reset the counter on resume — it persists across crashes to prevent infinite retry loops. - Present recovery plan to user before proceeding.
Mid-Execution Correction
If execution reveals that a completed milestone's output is incorrect or a new milestone is needed:
- Pause execution — do not continue with dependent milestones
- Log the discovery in state.md execution log: what was found, which milestone triggered the discovery
- User decision required: present the situation and options:
- Add corrective milestone: Create a new milestone definition (the user writes the goal and success criteria, or re-run milestone-planning for just the new scope). Insert it into the DAG with appropriate dependencies. Resume execution from the new milestone.
- Re-plan from a checkpoint: Roll back to a completed milestone's checkpoint, mark subsequent milestones as
pending, reset theirAttemptsto 0, and restart from that point. - Abort: Set overall status to
failedand stop.
- New milestones follow the same pipeline — plan-crafting → run-plan → review-work. No shortcuts even for "quick fixes."
- Completed milestones are never modified (Hard Gate #6 still applies). The corrective milestone produces new files or overwrites with a full plan cycle.
Skip Rules
When a user chooses to skip a failed milestone:
- Set milestone status to
skippedin state.md - Log the skip event with user's reason in execution log
- Dependents of a skipped milestone are also blocked by default — same as
failed. The DAG contract is: dependents run only after prerequisites arecompleted. - The user may explicitly unblock a dependent by acknowledging the missing prerequisite: "Proceed with M4 despite M2 being skipped." Log this override in the execution log.
- If the user unblocks a dependent, add a note to that milestone's Context Brief during plan-crafting: "Prerequisite M2 was skipped. The following outputs are missing: [list from M2's success criteria]."
Skipped milestones cannot be un-skipped. If the user wants to attempt the milestone later, create a new milestone with the same goal.
Duration Guard
If a single milestone's total active time (from planning start to review completion) becomes excessive:
- Soft limit: If a milestone has been in
planningorexecutingstatus for more than what appears to be a proportionally large share of the overall work, pause and report to user: "Milestone M3 has been in progress for an extended period. Continue, re-scope, or abort?" - Hard limit on attempts: The 3-attempt limit (F1) bounds retry loops. But if even a single attempt's plan-crafting generates more than 15 tasks, pause and report: "This milestone's plan has N tasks — it may be too large for a single milestone. Consider splitting."
- Purpose: Prevent a single runaway milestone from consuming the entire execution budget or running indefinitely on flaky tests.
Context Window Management
Long-running sessions will hit context window limits. Claude Code automatically compresses old messages (context collapse). The harness must be designed to survive this:
- Never rely on conversation memory for state. All state lives in
state.mdand milestone files on disk. If the context is compressed, the harness re-reads state files — no information is lost. - Each milestone is a fresh context boundary. When starting a new milestone's plan-crafting, the worker subagent starts with a clean context. It receives only the milestone definition and completed predecessor context (see F8 contract) — not the full conversation history.
- Checkpoint files are the source of truth. If context is lost mid-milestone, recovery reads the checkpoint files, not compressed conversation summaries.
- Avoid accumulating large inline state. Do not build up a running summary of all milestones in the conversation. Instead, reference state.md and checkpoint files by path.
Rate Limit Handling
Long-running sessions will encounter rate limits. Claude Code has built-in retry with exponential backoff (up to 10 retries, 5-minute max backoff). The harness should work with this, not against it:
- Let claude-code handle transient rate limits. Short 429/529 errors are retried automatically with backoff. Do not preemptively save state on every API error.
- Save state on persistent rate limits. If a rate limit persists beyond the automatic retry window (you'll see repeated "rate limit" messages), record current state to disk immediately.
- Log the rate limit event in execution log with timestamp.
- Report to user: "Rate limit hit. State saved. Resume with
long-runwhen ready." - Do NOT add manual retry loops on top of claude-code's built-in retry — this causes retry amplification.
- Background agent bail: Claude Code's background agents (like reviewer subagents) bail immediately on 529 overload errors instead of retrying. This is why Phase 2.5 reviewer failure handling exists — reviewer failures are often transient rate limits, not permanent errors.
Anti-Patterns
| Anti-Pattern | Why It Fails |
|---|---|
| Generating milestones inline instead of using milestone-planning | Milestones lack adversarial review; poor decomposition |
| Skipping review-work for "simple" milestones | Undetected defects compound across milestones |
| Continuing after a milestone fails | Dependent milestones build on broken foundation |
| Not updating state.md between phases | Crash loses progress; cannot resume |
| Modifying completed milestone files | Breaks checkpoint invariant; invalidates reviews |
| Running parallel milestones without worktree isolation | File conflicts corrupt both milestones |
| Auto-retrying on rate limit | Wastes quota; user may prefer to wait |
| Skipping user gates between milestones | User loses control of multi-day execution |
| Merging worktrees without conflict check | Silent data loss if files overlap |
| Skipping cross-milestone integration check | Milestones pass independently but break each other at boundaries |
| Retrying E2E failures indefinitely without user escalation | 2-attempt limit exists to avoid budget waste on misdiagnosed problems |
Minimal Checklist
- State directory exists with valid state.md and milestone files
- Dependency DAG validated (no cycles)
- Current position determined (fresh start or resume)
- User confirmed continuation at session start
- Each milestone goes through plan-crafting → run-plan → review-work
- State.md updated before and after every phase transition
- Checkpoint written after every successful milestone
- Failed milestones block dependents
- Parallel milestones use worktree isolation
- Cross-milestone integration check passes after each milestone
- Final E2E Gate passes at completion
- Full test suite passes at completion
Transition
After long run completion:
- For final code quality pass →
simplifyskill - If issues found in completion testing →
systematic-debuggingskill - If user wants to extend with more milestones →
milestone-planningskill
This skill itself does not invoke the next skill. It reports completion and lets the user decide the next step.
More from tmdgusya/engineering-discipline
clarification
Use when a user's request is vague, ambiguous, or underspecified. Launches an iterative Q&A loop to resolve ambiguity while a subagent explores the codebase in parallel. Outputs a clear, well-scoped context brief so the user can plan sharply. Triggers on "I want to...", "I need...", "let's build...", "can you help me...", "we should...", or any request where the full scope isn't immediately clear.
35run-plan
Use when you have a written implementation plan to execute. Loads the plan, reviews critically, executes tasks in dependency order, and reports completion. Triggers when the user says "run the plan", "execute the plan", or "let's start implementing".
34rob-pike
Rob Pike's 5 Rules of Programming — a decision framework that prevents premature optimization and enforces measurement-driven development. Use when the user says "optimize", "slow", "performance", "bottleneck", "speed up", "make faster", "too slow", or any request to improve code speed/efficiency. Also use when you notice yourself about to suggest a performance optimization without measurement data. This is a thinking discipline, not a tooling workflow.
33systematic-debugging
Use when encountering any bug, test failure, or unexpected behavior. Enforces a strict reproduce-first, root-cause-first, failing-test-first debugging workflow before fixing.
32plan-crafting
Use when a task's scope is clear and multi-step implementation is needed, before touching code. Triggered after clarification is complete, or when the user explicitly requests plan creation with a clear prompt.
31simplify
Review changed code for reuse opportunities, quality issues, and inefficiencies using three parallel review agents, then fix any issues found. Triggers when the user says "simplify", "clean up the code", "review the changes", or after run-plan execution when code quality verification is needed.
29