auto-benchmark
Auto Benchmark
A continuous, autonomous benchmarking system that monitors the competitive landscape, extracts insights from the latest research, proposes and runs improvement experiments, and keeps your solution ranked #1 — so engineers and researchers can focus on building product rather than running benchmarks manually.
System Overview
The system operates as a closed loop that runs on a schedule (daily, weekly, or on trigger):
┌─────────────────────────────────────────────────────────────────┐
│ CONTINUOUS LOOP │
│ │
│ [1] Monitor [2] Ingest [3] Hypothesize │
│ Competitors → Research → Improvements │
│ & Leaderboards Papers from Gap │
│ ↑ ↓ │
│ [6] Report ← [5] Promote ← [4] Experiment │
│ Stakeholders Winners Autonomously │
└─────────────────────────────────────────────────────────────────┘
Each iteration of the loop answers one question: "What can we do right now to move from our current rank to #1?"
Phase 1 — Competitive Landscape Setup
Do this once at system initialization; update the registry whenever new competitors emerge.
1.1 Define the Competitive Registry
Store a competitive_registry.yaml that is the system's single source of truth:
domain: memory # e.g., memory, retrieval, reasoning, vision
target_leaderboards:
- name: MemoryBench
url: https://...
scrape_method: html_table # or api, rss, manual
primary_metric: accuracy
higher_is_better: true
- name: LongContextEval
url: https://...
scrape_method: api
primary_metric: f1_score
higher_is_better: true
competitors:
- name: CompetitorA
latest_score: 0.847
source: MemoryBench
last_updated: 2026-02-01
- name: CompetitorB
latest_score: 0.831
source: MemoryBench
last_updated: 2026-02-05
our_solution:
name: OurSystem
current_scores:
MemoryBench: 0.823
LongContextEval: 0.791
promotion_threshold: 0.005 # minimum improvement over current score to promote
1.2 Define Victory Conditions
State explicitly what "#1" means for each leaderboard:
- Absolute rank: we must be the top entry by primary metric.
- Relative gap: our score must exceed the current #1 by at least N points to account for leaderboard update lag.
- Sustained: we must hold the position for at least K consecutive evaluation cycles.
1.3 Set the Monitoring Schedule
schedule:
leaderboard_scrape: "0 6 * * *" # daily at 6am
research_ingest: "0 7 * * 1" # weekly on Monday
experiment_sweep: "0 8 * * *" # daily at 8am
report_digest: "0 9 * * 1" # weekly digest on Monday
Phase 2 — Competitive Monitoring
On each scheduled run, update the competitive landscape before doing anything else.
2.1 Scrape Leaderboards
For each leaderboard in the registry:
- Fetch the current rankings.
- Compare against the stored snapshot from the previous run.
- Detect changes: new entrants, score improvements, rank shifts.
- Update
competitive_registry.yamlwith the latest scores.
Emit a competitive delta report on any change:
[ALERT] CompetitorA improved on MemoryBench: 0.847 → 0.861 (+0.014)
[ALERT] New entrant: StartupX at 0.855 — now ranked #2, ahead of us
[STATUS] Our rank: #3 | Gap to #1: -0.038
2.2 Compute the Gap Matrix
Produce a ranked table after every scrape:
| Rank | System | MemoryBench | LongContextEval | Δ to Our Score |
|------|--------------|-------------|-----------------|----------------|
| #1 | CompetitorA | 0.861 | 0.812 | -0.038 |
| #2 | StartupX | 0.855 | 0.798 | -0.032 |
| #3 | OurSystem | 0.823 | 0.791 | — |
| #4 | CompetitorB | 0.801 | 0.764 | +0.022 |
The gap to #1 on each leaderboard is the primary input to Phase 3.
2.3 Trigger Logic
- If our rank drops (competitor overtook us): immediately trigger Phase 4 with urgent priority.
- If gap to #1 widens beyond a threshold: trigger a research ingest cycle.
- If we are already #1 by the victory margin: run a maintenance sweep to defend the position.
Phase 3 — Research Ingestion Pipeline
Continuously pull the latest research and translate it into actionable improvement candidates.
3.1 Paper Sources to Monitor
Configure sources in research_config.yaml:
research_sources:
arxiv_queries:
- "memory augmented neural networks"
- "long context transformers 2026"
- "retrieval augmented generation benchmark"
venues:
- ICLR 2026
- NeurIPS 2025
- ICML 2026
competitor_blogs:
- https://competitor-a.ai/research
citation_tracking:
- track papers that cite our core method
3.2 Paper Processing Protocol
For each new paper found:
- Relevance screen: does it benchmark on our target leaderboard(s) or report SOTA on our domain?
- Technique extraction: identify every architectural change, training trick, data strategy, or inference optimization the paper claims is responsible for improvement.
- Applicability assessment: for each technique, answer:
- Can it be applied to our architecture without a full redesign?
- Estimated implementation effort: Low / Medium / High
- Estimated score impact (based on paper's reported gains): Low (<1%) / Medium (1–3%) / High (>3%)
- Prioritize by: (estimated impact) × (1 / effort).
- Log to
research_log.md:
## [2026-02-10] Paper: "HyperMemory: Hierarchical State Spaces for Long-Context Recall"
**Source:** arXiv:2602.XXXXX
**Relevant leaderboard:** MemoryBench
**Reported gain:** +4.2% on MemoryBench vs prior SOTA
**Techniques extracted:**
- Hierarchical state compression (effort: Medium, impact: High) ← PRIORITIZED
- Cosine decay + warmup schedule (effort: Low, impact: Low)
- Synthetic data augmentation for long-range dependencies (effort: High, impact: Medium)
**Action:** Generate hypothesis for hierarchical state compression — schedule experiment.
3.3 Competitor Architecture Reverse-Engineering
When a competitor publishes a technical report or open-sources code:
- Extract every architectural and training detail.
- Diff against our current architecture.
- Flag differences as candidate experiments.
Phase 4 — Improvement Hypothesis Generation
Translate the competitive gap and research findings into a ranked experiment queue.
4.1 Hypothesis Format
Each hypothesis must state:
hypothesis:
id: H-042
title: "Hierarchical state compression reduces long-context forgetting"
claim: "Applying hierarchical state compression to our memory module will
improve MemoryBench accuracy from 0.823 to ≥0.850"
source: arXiv:2602.XXXXX + gap analysis (gap to #1 = 0.038)
target_leaderboards: [MemoryBench]
implementation:
change: "Replace flat KV cache with 3-level hierarchical compression"
effort: Medium
estimated_gain: +0.027
priority_score: 8.1 # (estimated_gain / effort_score) × urgency_multiplier
status: queued
4.2 Experiment Queue Management
Maintain a experiment_queue.yaml ranked by priority_score:
- Urgent (rank drop detected): run immediately, single seed first for a fast signal.
- Normal: run in scheduled daily sweep.
- Low priority: run only when compute budget allows.
Limit the queue to the top 10 active hypotheses. Archive superseded ones.
4.3 Architecture Tweak Proposals
For tweaks that don't come from papers (e.g., hyperparameter tuning):
- Use the gap size to determine search strategy:
- Gap < 1%: targeted fine-grained search (LR, weight decay).
- Gap 1–5%: component-level ablation (attention heads, layer depth, context window).
- Gap > 5%: architectural change required — escalate to research team with a written brief.
Phase 5 — Automated Experiment Execution
5.1 Directory Structure
experiments/
├── queue/ # pending hypotheses (YAML files)
├── active/ # currently running
├── results/
│ └── <hypothesis_id>/
│ ├── config.yaml
│ ├── metrics.json
│ ├── train.log
│ └── eval_on_leaderboard.json
├── promoted/ # configs promoted to production
└── archived/ # failed or superseded experiments
5.2 Runner Behavior
The automated runner:
- Picks the highest-priority item from the queue.
- Runs a fast validation (1 seed, subset of data) to catch implementation bugs. If it fails, move to
archived/with failure notes. Do not waste full compute on broken configs. - If fast validation passes, run the full experiment (3+ seeds, full eval suite).
- Write results to
results/<hypothesis_id>/metrics.json. - Compare against the current production score.
- Move to either
promoted/orarchived/based on Phase 5 promotion logic. - Start the next item in the queue.
5.3 Reproducibility Requirements
- All configs stored as YAML — no hardcoded values in scripts.
- Exact dependency versions pinned (
requirements.lock). - Fixed seeds applied to all RNG sources.
- Deterministic evaluation (disable dropout at eval, fixed batch order).
5.4 Failure Handling
- Failed run (crash/OOM): retry once, then archive with full error log.
- Result below baseline: archive immediately with gap analysis notes.
- Inconclusive result (mean improvement < threshold, high std): schedule an extended run with more seeds before deciding.
Phase 6 — Promotion Decision
A new configuration is promoted to production only when all of the following are true:
| Criterion | Requirement |
|---|---|
| Primary metric improvement | ≥ promotion_threshold above current production score |
| Statistical significance | p < 0.05 on paired t-test vs production config |
| No regression on secondary metrics | Latency within 10%, memory within 15% |
| Reproducibility | Consistent across ≥ 3 seeds (std < 0.5% of mean) |
| Leaderboard projection | If promoted, would we reach or exceed #1? |
If promotion is approved:
- Write the new config to
promoted/with a timestamp. - Update
competitive_registry.yaml→our_solution.current_scores. - Trigger a leaderboard submission if the target supports API submission.
- Log the promotion event in
CHANGELOG.md.
If rejected, write a clear rejection note explaining which criterion failed.
Phase 7 — Continuous Monitoring and Defense
Once at #1, the system switches to defense mode:
- Run the full eval suite on the production config on every scheduled cycle.
- Alert immediately if our score regresses (environment drift, data drift, dependency update).
- Watch for new leaderboard entrants within 5% of our score — pre-emptively queue experiments.
- Re-enter attack mode the moment any competitor overtakes us.
Phase 8 — Stakeholder Reporting
Produce two report types automatically:
8.1 Weekly Digest (for leadership / product)
# Benchmark Digest — Week of YYYY-MM-DD
## Competitive Position
- MemoryBench: #1 ✅ (our score: 0.871 | gap to #2: +0.010)
- LongContextEval: #2 ⚠️ (our score: 0.812 | gap to #1: -0.006)
## What Changed This Week
- Promoted H-042 (hierarchical compression): +0.031 on MemoryBench
- CompetitorA improved LongContextEval to 0.818 — now ahead of us
## Next Actions (Automated)
- Experiment H-047 (synthetic data aug) queued for LongContextEval — est. gain +0.009
- Research ingest scheduled for Monday
## Experiments Run This Week
- 4 experiments completed | 3 failed fast validation | 1 promoted
8.2 Technical Log (for engineers / researchers)
Full structured log at TECHNICAL_LOG.md:
- Every experiment run with config, metrics, and outcome.
- Every paper ingested with extracted techniques and disposition.
- Every promotion and rejection with full rationale.
- Current experiment queue with priority scores.
Engineers should be able to review the log in under 5 minutes and understand exactly what the system did and why.
Principles
- The system runs itself. Engineers should only intervene for High-effort architectural changes that exceed automated scope — not for routine benchmark runs.
- Competitive rank is the north star. Every experiment exists to close the gap to #1, not to satisfy academic curiosity.
- Speed over perfection on urgency. When a competitor overtakes us, run a fast 1-seed signal first. A rough answer in hours beats a perfect answer in days.
- Never tune on the test set. If the leaderboard uses a hidden test set, never use it for development decisions.
- Transparency over automation comfort. Every automated decision (promote / reject / archive) must be logged with a reason a human can audit.
- Research is a feed, not a manual task. Papers are inputs to a pipeline, not homework for researchers.
- Failed experiments have value. Log what didn't work and why — this prevents re-running the same dead ends.
Quick-Start Checklist
One-time setup:
-
competitive_registry.yamlpopulated with leaderboards and competitors - Victory condition defined per leaderboard
-
research_config.yamlconfigured with paper sources and domain queries - Scheduler configured (cron / CI pipeline / Airflow / GitHub Actions)
- Runner script deployed and tested on one hypothesis end-to-end
Each automated cycle (verify the system does this):
- Leaderboards scraped and gap matrix updated
- Competitive delta report generated
- New papers screened and techniques extracted
- Experiment queue re-ranked
- Highest-priority experiment run
- Promotion decision made and logged
- Weekly digest sent to stakeholders
Escalate to humans only when:
- Gap to #1 > 5% and no queued hypothesis is projected to close it
- A competitor's architecture cannot be reverse-engineered from public sources
- A promoted config causes a latency or cost regression beyond automated thresholds