Staged Evaluation
Staged Evaluation
A key optimization from HyperAgents: don't waste compute evaluating obviously broken mutations. Run a cheap quick check first, and only invest in full evaluation for promising candidates.
The Problem
Full evaluation is expensive:
- Running a full test suite takes minutes
- LLM-as-judge evaluations cost tokens
- Benchmark suites can take hours
- Most mutations (especially early ones) produce broken or worse code
The Solution: Two-Phase Evaluation
Phase 1: Staged Evaluation (Quick Check)
Run on a small sample to detect obvious failures:
- Use 10% of the full sample set, or a fixed small number (e.g., 10 items)
- Timeout aggressively (1/10th of full timeout)
- Score the results
Decision rule:
- Score is 0 or null → FAIL FAST, skip full evaluation
- Score is non-zero → PROCEED to full evaluation
Phase 2: Full Evaluation
Run on the complete sample set:
- Use all available evaluation items
- Full timeout allowance
- Generate comprehensive report with per-item details
Implementation
function evaluate_generation(genid, domain):
# Phase 1: Staged
staged_samples = get_staged_sample_count(domain) # e.g., 10
staged_score = run_evaluation(genid, domain, samples=staged_samples)
if staged_score is None or staged_score <= 0:
mark_generation_as_failed(genid)
return # FAIL FAST
# Phase 2: Full
full_score = run_evaluation(genid, domain, samples=-1) # all samples
store_score(genid, domain, full_score)
Score Adjustment
When comparing a staged-only generation against fully-evaluated ones, adjust the score:
adjusted_score = staged_score * (staged_samples / full_samples)
This prevents a generation that only passed 10 easy items from appearing better than one that passed 80 out of 100.
Configuration
In .hyperagents/config.json:
{
"staged_eval": {
"enabled": true,
"threshold": 0,
"samples": {
"tests": 10,
"lint": 5,
"review": 3,
"benchmark": 5
},
"fractions": {
"tests": 0.1,
"lint": 0.1,
"review": 0.1,
"benchmark": 0.1
}
}
}
When to Skip Staged Evaluation
Use --skip-staged flag when:
- You're confident the mutation is valid (e.g., minor prompt tweak)
- The full evaluation is already fast (< 30 seconds)
- You need accurate scores for every generation (research/analysis)
- You're running a final evaluation of the best generation
Examples
These scenarios illustrate when this skill activates and what it does.
Scenario 1: Staged eval catches a broken mutation early
Trigger: The evolve loop runs staged evaluation on generation 3, which has a syntax error introduced by the meta-agent.
Action: The staged eval runs 10 out of 100 test items. All 10 fail because the code does not parse, producing a score of 0. The skill marks generation 3 as failed (valid_parent: false) and skips the full 100-item evaluation, saving approximately 90% of the compute cost for this generation. The loop proceeds immediately to generation 4.
Scenario 2: User skips staged eval for a known-safe change
Trigger: User runs /hyperagents:evolve --skip-staged because they are making a minor prompt wording adjustment and want accurate full-evaluation scores.
Action: The skill bypasses the two-phase pattern and runs the full evaluation directly. No score adjustment is applied since the full sample set is used. This is appropriate when the mutation is low-risk and the user prefers precision over speed.
Scenario 3: Understanding score adjustment after staged-only evaluation
Trigger: User asks "Why does generation 6 show a score of 0.08 when its staged eval was 0.80?"
Action: The skill explains the score adjustment formula: adjusted_score = 0.80 * (10 / 100) = 0.08. Because generation 6 only passed staged evaluation (the full eval was not run, possibly due to a loop interruption), its raw score of 0.80 is scaled down by the staged-to-full sample ratio. This prevents a generation evaluated on 10 easy items from outcompeting one that passed 80 out of 100 items in a full run.
Cost Savings
Typical savings in a 20-generation evolution run:
- Without staged eval: 20 full evaluations
- With staged eval: ~5 full evaluations (75% savings)
- Most mutations fail early, especially in the first few generations
More from zpankz/hyperagents
fitness evaluation framework
Domain-agnostic fitness evaluation for evolved code generations. Defines evaluation harness interfaces, scoring contracts, and multi-domain aggregation. Triggers when evaluating code quality, running benchmarks, or scoring agent outputs.
1parent selection strategies
Evolutionary parent selection algorithms for choosing which generation to mutate next. Implements random, best, score-proportional, and novelty-aware selection. Triggers when selecting parents, managing exploration/exploitation tradeoffs, or configuring evolution strategy.
1domain evaluation harness
Create and configure domain-specific evaluation harnesses for the HyperAgents evolution loop. Defines how tasks are loaded, agents are invoked, predictions are collected, and scores are computed. Triggers when setting up evaluation domains or creating custom fitness functions.
1self-referential self-improvement
Apply HyperAgents' self-referential improvement pattern to any code artifact. Triggers when Claude is asked to 'improve', 'optimize', 'evolve', or 'self-improve' code, agents, skills, or prompts. Also triggers on repeated failures as an automatic recovery strategy.
1evolutionary archive management
Manage the HyperAgents evolutionary archive — an append-only log of all code generations with fitness scores, lineage tracking, and diff storage. Triggers when working with .hyperagents/ directory, archive.jsonl files, or generation metadata.
1