recursive-benchmark

Installation

SKILL.md

recursive-benchmark

Use this skill to run a fair benchmark that compares the same coding agent with recursive-mode off and recursive-mode on.

The benchmark should use the same project requirements, the same model family, and the same success criteria for both arms. The recursive-on arm should additionally start from a bootstrapped recursive-mode scaffold, a run-local 00-requirements.md, and a command-style prompt that explicitly tells the agent to use the bootstrapped recursive control-plane files as the recursive-mode skill before implementing the run end to end.

For fairness, the recursive-off arm should receive controller guidance only in the chat prompt, not as benchmark requirement, rubric, or prompt documents inside its repo or benchmark workspace.

Current maintained benchmark runners are Codex CLI, Kimi CLI, and OpenCode CLI. For OpenCode, prefer provider-qualified model ids and use the dedicated CLI binary rather than the desktop wrapper.

Primary Use Case

Use recursive-benchmark when the user wants to:

compare recursive-mode against a non-recursive baseline
measure whether recursive-mode improves implementation quality, reliability, or completion rate
run disposable benchmark repos in temp folders
capture build/test/preview outcomes, timings, issues, and final scores
capture and report screenshot artifacts produced during the benchmark
generate a markdown benchmark report or dashboard
choose among packaged easy, medium, hard, and xhard benchmark scenarios
optionally run the paired arms in parallel when the runtime/provider can tolerate it; unstable runners should fall back to sequential execution and record that downgrade in the report

Benchmark Contract

For each benchmark run:

Create paired disposable repos for recursive-off and recursive-on.
Give both repos the same benchmark project requirements.
Bootstrap recursive-mode only in the recursive-on repo and place the benchmark requirements in a run-local, recursive-compliant 00-requirements.md.
Prompt the recursive-on arm to read /.recursive/RECURSIVE.md, the bridge docs, the router config files, and the run requirements before implementing the run.
Record the runner, provider family, model string, and timeout budget.
Execute the selected agent runtime non-interactively for both arms.
Run a mandatory controller-side judge review for every completed arm, preferring gpt-5.4 and falling back to a fresh instance of the benchmarked model when needed.
Capture logs, durations, issues, screenshot artifacts, live progress artifacts, and evaluation outcomes, including whether the recursive-on arm produced the expected run artifacts through 08-memory-impact.md, passed controller-side recursive run lint, and required an isolated product snapshot for Rust build/test/preview evaluation.
Keep repo-local benchmark workspaces such as .benchmark-workspaces/ ignored when the harness runs inside the packaged repo.
Produce a final markdown report that compares the two arms side by side, including a combined benchmark score that blends heuristic rubric coverage with the mandatory judge metric.
Surface whether recursive-on completed the recursive artifact set, whether it passed controller-side recursive lint, and whether it used an isolated worktree or stayed in the repo root.

Packaged Scenario Tiers

local-first-planner - easy
team-capacity-board - medium
release-readiness-dashboard - hard
scientific-calculator-rust - xhard

All packaged scenarios should stay:

browser-local state only
no external database or server dependencies
local browser preview should work from a temp folder
output should be suitable for later screenshot validation

Current packaged stacks:

React + TypeScript + Vite for easy/medium/hard
Rust + WebAssembly with Trunk for xhard

Logging Requirements

The benchmark should preserve:

raw agent stdout/stderr or JSON event logs when available
per-phase timing data
per-arm live progress files
build/test/preview results
screenshot paths and image embeds when screenshots exist
timeout or failure reasons
benchmark repo paths and report paths
token or usage data only when the underlying CLI exposes it

Both benchmark arms should also ask the coding agent to maintain a simple in-repo benchmark activity log. If the controller provides hints during the benchmark, the arm should record them in benchmark/hints.md so the report can apply any configured hint penalty.

Output

The benchmark should produce a final report that includes:

benchmark scenario name
provider/runtime and model
recursive-off vs recursive-on comparison
total duration and timeout status
build/test/preview outcomes
screenshot galleries for both arms when screenshots exist
separated runner health vs product outcome
heuristic score breakdown
mandatory code-review judge metric and reviewer identity
combined benchmark score that weights heuristic coverage and judge review together
recursive-on worktree isolation status and recorded worktree location
artifact paths for live progress inspection
notable issues or gaps
links or relative paths to logs and generated artifacts
timestamp fallback evidence when agent logging is incomplete

Fairness Rules

Keep the project spec identical between both arms.
Do not silently give one arm different acceptance criteria.
Record when a metric is unavailable instead of faking it.
Keep the benchmark disposable; do not contaminate this reusable repo with run residue.
Use the same timeout budget and scoring rubric for both arms.
If one arm receives hints, record them and reflect the configured penalty in the final scoring.

Boundaries

This skill is for benchmark setup, execution, and reporting.
It does not replace the recursive-mode workflow spec itself.
It should not use hidden benchmark-specific criteria that are absent from the packaged rubric.
It should not require external services such as a database server.

When the recursive-on arm uses delegated audit, review, or other external model help, the benchmark prompt should require it to re-read /.recursive/config/recursive-router.json and /.recursive/config/recursive-router-discovered.json immediately before choosing the delegated CLI/model.

References

./references/patterns.md
/references/benchmarks/README.md
/references/benchmarks/local-first-planner/README.md
/references/benchmarks/local-first-planner/00-requirements.md
/references/benchmarks/local-first-planner/scoring-rubric.md
./scripts/run-recursive-benchmark.py

Related skills

More from try-works/recursive-mode

Installs

Repository

try-works/recursive-mode

GitHub Stars

First Seen

Apr 14, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

recursive-benchmark

recursive-benchmark

Primary Use Case

Benchmark Contract

Packaged Scenario Tiers

Logging Requirements

Output

Fairness Rules

Boundaries

References

More from try-works/recursive-mode

recursive-mode

recursive-tdd

recursive-subagent

recursive-worktree

recursive-review-bundle

recursive-debugging