ln-840-benchmark-compare
Paths: File paths (
shared/,references/) are relative to skills repo root. Locate this SKILL.md directory and go up one level for repo root.
Benchmark Compare
Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark
Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with hex-line. The benchmark is scenario-based, diff-validated, manifest-driven, and runtime-backed. It measures activation, correctness, time, cost, and tokens. The current runner is intentionally scoped to this internal A/B. It does not, by itself, prove best-in-class against external alternatives.
Input / Output
| Direction | Content |
|---|---|
| Input | Repo checkout containing mcp/hex-line-mcp/, optional references/goals.md, optional references/expectations.json |
| Output | Comparison report in skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md plus machine-readable benchmark summary artifact |
Prerequisites
claude --versionsucceedsgitsucceedsmcp/hex-line-mcp/server.mjsexistsmcp/hex-line-mcp/hook.mjsexistsskills-catalog/ln-840-benchmark-compare/references/goals.mdexistsskills-catalog/ln-840-benchmark-compare/references/expectations.jsonexistsskills-catalog/ln-840-benchmark-compare/references/mcp-bench.jsonexists
Quick Run
bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh \
[skills-catalog/ln-840-benchmark-compare/references/goals.md] \
[skills-catalog/ln-840-benchmark-compare/references/expectations.json]
Optional extra session profile:
EXTRA_SESSION_ID=other-mcp \
EXTRA_SESSION_LABEL="Other MCP" \
EXTRA_MCP_CONFIG=/abs/path/to/other-mcp.json \
EXTRA_SETTINGS='{"disableAllHooks":true}' \
bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh
Monitor Integration (Claude Code 2.1.98+)
MANDATORY READ: Load shared/references/monitor_integration_pattern.md
Stream benchmark progress:
Monitor(command="bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh 2>&1 | grep --line-buffered -E 'scenario|PASS|FAIL|error|session'", timeout_ms=3600000, description="benchmark run")
Fallback: Bash(run_in_background=true).
The runner handles:
- syntax preflight
- SessionStart preflight
- scenario extraction from
goals.md - isolated worktrees per scenario/session
- per-scenario diffs
- final comparison report
Current scope:
- built-in Claude session
- Claude plus
hex-line - optional third Claude-compatible session profile through
EXTRA_SESSION_*environment variables
External baseline note:
- use the same
goals.mdandexpectations.json - do not rewrite scenarios to fit the external tool
- do not make "top tool" claims from the internal A/B alone
- the optional third session profile is only valid when it can emit the same
stream-jsonlog shape and diff artifacts
Workflow
Phase 1: Define The Canonical Suite
Use one canonical pair owned by this skill:
skills-catalog/ln-840-benchmark-compare/references/goals.mdskills-catalog/ln-840-benchmark-compare/references/expectations.json
Rules:
- The suite must be a balanced mix of common engineering scenarios.
- Do not design the suite to favor
hex-line. - Every scenario in
goals.mdmust have a matching entry inexpectations.json. expectations.jsonis the source of truth for correctness.- The same pair must be reused unchanged for any future external baseline.
Supported expectation fields per scenario:
| Field | Meaning |
|---|---|
id |
Scenario identifier used in result filenames |
expectedChangedFiles |
Files that must change |
forbiddenChangedFiles |
Files that must not change |
requiredDiffPatterns |
Regex patterns required in the saved diff |
forbiddenDiffPatterns |
Regex patterns that must not appear in the diff |
requiredResultPatterns |
Regex patterns required in the final assistant result text |
requiredCommands |
Regex patterns that must match at least one Bash command |
exactChangedFiles |
If true, no extra changed files are allowed |
Phase 2: Preflight
The runner must pass:
node --check server.mjsnode --check hook.mjsnode --check extract-scenarios.mjsnode --check parse-results.mjs- SessionStart smoke check from
hook.mjs
If preflight fails, the benchmark is invalid and must stop before scenarios run.
Phase 3: Execute Per Scenario
For each ## scenario in goals.md:
- generate a standalone prompt file
- create two clean worktrees from the same commit
- run built-in Claude session
- run hex-line Claude session
- save
.jsonllogs and.diff.txtartifacts - remove both worktrees
Built-in session:
- no MCP
- hooks disabled
Hex-line session:
- resolved MCP config pointing to
server.mjs outputStyle: "hex-line"PreToolUsehook throughhook.mjs
Phase 4: Parse Results
parse-results.mjs evaluates each scenario for both sessions.
Scenario pass requires:
- valid run
- successful session completion
- changed files match expectations
- diff patterns match expectations
- result text patterns match expectations
- required commands were actually executed
Phase 5: Read The Report
The final report has these sections:
- Scenario Outcomes
- Activation
- Time
- Cost
- Tokens
- Tool Totals
- Validity
Interpretation rules:
invalid runmeans setup/adoption failure, not product performance- scenario
FAILmeans correctness contract was not met - activation is part of product quality for
hex-line, not external noise - this report is necessary for internal A/B evaluation, but not sufficient for best-alternative claims
Report Contract
skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md must answer:
- Did each scenario complete correctly?
- Did
hex-lineactivate cleanly without discovery drift? - What changed in wall time, API time, cost, output tokens, and total tool calls?
- Was the run valid?
Do not treat raw time/cost as sufficient without scenario correctness.
External Baseline Policy
- This skill owns the canonical suite, not a universal leaderboard.
- If maintainers compare
hex-lineagainst external alternatives, they must reuse the samegoals.md,expectations.json, and diff-based evaluation rules. - External runs may use different harnesses, but they must preserve the same task text, starting commit, and correctness contract.
- If an external tool cannot satisfy the contract format, record that as a harness limitation instead of rewriting the suite to accommodate it.
- A report that only covers built-in Claude vs
hex-linemust say so explicitly.
Runtime Contract
MANDATORY READ: Load shared/references/benchmark_worker_runtime_contract.md, shared/references/coordinator_summary_contract.md
Runtime CLI:
node shared/scripts/benchmark-worker-runtime/cli.mjs start --skill ln-840-benchmark-compare --identifier suite-default --manifest-file <file>
node shared/scripts/benchmark-worker-runtime/cli.mjs checkpoint --skill ln-840-benchmark-compare --identifier suite-default --phase PHASE_0_CONFIG --payload '{...}'
node shared/scripts/benchmark-worker-runtime/cli.mjs record-summary --skill ln-840-benchmark-compare --identifier suite-default --payload '{...}'
node shared/scripts/benchmark-worker-runtime/cli.mjs complete --skill ln-840-benchmark-compare --identifier suite-default
Required state fields:
report_readysummary_recordedfinal_resultself_check_passed
Domain checkpoints:
PHASE_0_CONFIGPHASE_1_PREFLIGHTPHASE_2_LOAD_SUITEPHASE_3_RUN_SCENARIOSPHASE_4_PARSE_RESULTSPHASE_5_WRITE_REPORTPHASE_6_WRITE_SUMMARYPHASE_7_SELF_CHECK
Guard rules:
- do not advance without checkpointing the current phase
- do not complete before
benchmark-workersummary is recorded - do not complete before self-check passes
Runtime Coordination
- Managed runs may pass deterministic
runIdand exactsummaryArtifactPath. - Standalone runs are supported. If both are omitted, runtime creates a standalone run and writes the default summary artifact path for the
benchmark-workerfamily.
Runtime Summary Artifact
MANDATORY READ: Load shared/references/coordinator_summary_contract.md
Emit a benchmark-worker summary envelope after the comparison report is written.
Managed mode:
- write to the exact
summaryArtifactPath
Standalone mode:
- write
.hex-skills/runtime-artifacts/runs/{run_id}/benchmark-worker/ln-840-benchmark-compare--{identifier}.json
Recommended payload:
scenarios_totalscenarios_passedscenarios_failedactivation_validvalidity_verdictreport_pathwarningsmetrics
Known Pitfalls
| Pitfall | Solution |
|---|---|
| SessionStart not present in hex-line run | Fail preflight and stop |
Agent drifts into ToolSearch before hex-line use |
Treat as activation problem and capture in report |
| Worktree already exists from prior crash | Remove it before adding a new one |
| Diff artifacts missing | Treat scenario correctness as failed |
| Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking |
| External comparison uses edited scenarios or relaxed expectations | Treat the comparison as invalid |
Definition of Done
-
goals.mddefines the canonical balanced suite -
expectations.jsonfully describes scenario correctness - Runner passes syntax and SessionStart preflight
- Each scenario runs in two clean worktrees from the same commit
- Parser evaluates activation and scenario correctness from logs plus diffs
- Final report is saved to
skills-catalog/ln-840-benchmark-compare/results/ -
benchmark-workersummary artifact is written to the managed or standalone runtime path - Temporary worktrees are removed
- Report states clearly whether it is internal A/B only or includes additional external baselines
Version: 2.0.0 Last Updated: 2026-03-24