recursive-benchmark
recursive-benchmark
Use this skill to run a fair benchmark that compares the same coding agent with recursive-mode off and recursive-mode on.
The benchmark should use the same project requirements, the same model family, and the same success criteria for both arms. The recursive-on arm should additionally start from a bootstrapped recursive-mode scaffold, a run-local 00-requirements.md, and a command-style prompt that explicitly tells the agent to use the bootstrapped recursive control-plane files as the recursive-mode skill before implementing the run end to end.
For fairness, the recursive-off arm should receive controller guidance only in the chat prompt, not as benchmark requirement, rubric, or prompt documents inside its repo or benchmark workspace.
Current maintained benchmark runners are Codex CLI, Kimi CLI, and OpenCode CLI. For OpenCode, prefer provider-qualified model ids and use the dedicated CLI binary rather than the desktop wrapper.
Primary Use Case
Use recursive-benchmark when the user wants to:
- compare recursive-mode against a non-recursive baseline
- measure whether recursive-mode improves implementation quality, reliability, or completion rate
- run disposable benchmark repos in temp folders
- capture build/test/preview outcomes, timings, issues, and final scores
- capture and report screenshot artifacts produced during the benchmark
- generate a markdown benchmark report or dashboard
- choose among packaged easy, medium, hard, and xhard benchmark scenarios
- optionally run the paired arms in parallel when the runtime/provider can tolerate it; unstable runners should fall back to sequential execution and record that downgrade in the report
Benchmark Contract
For each benchmark run:
- Create paired disposable repos for
recursive-offandrecursive-on. - Give both repos the same benchmark project requirements.
- Bootstrap recursive-mode only in the recursive-on repo and place the benchmark requirements in a run-local, recursive-compliant
00-requirements.md. - Prompt the recursive-on arm to read
/.recursive/RECURSIVE.md, the bridge docs, the router config files, and the run requirements before implementing the run. - Record the runner, provider family, model string, and timeout budget.
- Execute the selected agent runtime non-interactively for both arms.
- Run a mandatory controller-side judge review for every completed arm, preferring
gpt-5.4and falling back to a fresh instance of the benchmarked model when needed. - Capture logs, durations, issues, screenshot artifacts, live progress artifacts, and evaluation outcomes, including whether the recursive-on arm produced the expected run artifacts through
08-memory-impact.md, passed controller-side recursive run lint, and required an isolated product snapshot for Rust build/test/preview evaluation. - Keep repo-local benchmark workspaces such as
.benchmark-workspaces/ignored when the harness runs inside the packaged repo. - Produce a final markdown report that compares the two arms side by side, including a combined benchmark score that blends heuristic rubric coverage with the mandatory judge metric.
- Surface whether recursive-on completed the recursive artifact set, whether it passed controller-side recursive lint, and whether it used an isolated worktree or stayed in the repo root.
Packaged Scenario Tiers
local-first-planner- easyteam-capacity-board- mediumrelease-readiness-dashboard- hardscientific-calculator-rust- xhard
All packaged scenarios should stay:
- browser-local state only
- no external database or server dependencies
- local browser preview should work from a temp folder
- output should be suitable for later screenshot validation
Current packaged stacks:
- React + TypeScript + Vite for easy/medium/hard
- Rust + WebAssembly with Trunk for xhard
Logging Requirements
The benchmark should preserve:
- raw agent stdout/stderr or JSON event logs when available
- per-phase timing data
- per-arm live progress files
- build/test/preview results
- screenshot paths and image embeds when screenshots exist
- timeout or failure reasons
- benchmark repo paths and report paths
- token or usage data only when the underlying CLI exposes it
Both benchmark arms should also ask the coding agent to maintain a simple in-repo benchmark activity log.
If the controller provides hints during the benchmark, the arm should record them in benchmark/hints.md so the report can apply any configured hint penalty.
Output
The benchmark should produce a final report that includes:
- benchmark scenario name
- provider/runtime and model
- recursive-off vs recursive-on comparison
- total duration and timeout status
- build/test/preview outcomes
- screenshot galleries for both arms when screenshots exist
- separated runner health vs product outcome
- heuristic score breakdown
- mandatory code-review judge metric and reviewer identity
- combined benchmark score that weights heuristic coverage and judge review together
- recursive-on worktree isolation status and recorded worktree location
- artifact paths for live progress inspection
- notable issues or gaps
- links or relative paths to logs and generated artifacts
- timestamp fallback evidence when agent logging is incomplete
Fairness Rules
- Keep the project spec identical between both arms.
- Do not silently give one arm different acceptance criteria.
- Record when a metric is unavailable instead of faking it.
- Keep the benchmark disposable; do not contaminate this reusable repo with run residue.
- Use the same timeout budget and scoring rubric for both arms.
- If one arm receives hints, record them and reflect the configured penalty in the final scoring.
Boundaries
- This skill is for benchmark setup, execution, and reporting.
- It does not replace the recursive-mode workflow spec itself.
- It should not use hidden benchmark-specific criteria that are absent from the packaged rubric.
- It should not require external services such as a database server.
When the recursive-on arm uses delegated audit, review, or other external model help, the benchmark prompt should require it to re-read /.recursive/config/recursive-router.json and /.recursive/config/recursive-router-discovered.json immediately before choosing the delegated CLI/model.
References
./references/patterns.md/references/benchmarks/README.md/references/benchmarks/local-first-planner/README.md/references/benchmarks/local-first-planner/00-requirements.md/references/benchmarks/local-first-planner/scoring-rubric.md./scripts/run-recursive-benchmark.py
More from try-works/recursive-mode
recursive-mode
Repository workflow orchestration skill for staged implementation, locked artifacts, late-phase receipts, and durable memory maintenance. Use when executing recursive-mode runs, resuming a run, locking a phase, or verifying locks.
62recursive-tdd
Use when implementing any code in recursive-mode Phase 3. Enforces strict RED-GREEN-REFACTOR discipline with The Iron Law - no production code without a failing test first. Trigger phrases: "implement this", "add feature", "fix bug", "write a failing test", "TDD".
36recursive-subagent
Use when recursive-mode work may benefit from delegated audit, review, or bounded implementation support. This skill prioritizes analyst, planner, code-reviewer, memory-auditor, tester, and implementer roles, with mandatory self-audit fallback when subagents are unavailable.
35recursive-worktree
Use when starting any recursive-mode requirement to set up an isolated git worktree. Required before implementation phases: create an isolated workspace, verify a clean baseline, and keep main/master clean.
35recursive-review-bundle
Use when recursive-mode work needs a canonical delegated-review or audit handoff. Generates reproducible review bundles for Phase 3.5 code review, test review, or other delegated checks using the repo review-bundle scripts.
35recursive-debugging
Use when a recursive-mode requirement involves debugging a bug, test failure, or unexpected behavior. Insert Phase 1.5 between Phase 1 and Phase 2 to perform systematic root cause analysis before attempting any fixes. Trigger phrases: "debug", "investigate", "failing tests", "crash", "root cause".
35