baseline-selection-audit
Baseline Selection Audit
Turn a claim, method, draft experiment plan, or literature map into a reviewer-proof baseline set and fairness ledger.
Use this skill when:
- experiments are being planned and the right baselines are unclear
- a literature review found competitors but they have not been converted into comparisons
- a paper may be missing SOTA, direct competitors, classics, ablation baselines, or control baselines
- a reviewer might complain about unfair tuning, scale, data, compute, protocol, or metric differences
- a rebuttal or revision needs to decide which additional baseline experiment is worth running
- the user needs to justify why a baseline is excluded as not comparable
Do not use this skill for citation metadata checks. Use citation-audit for BibTeX and LaTeX correctness. Use citation-coverage-audit when the primary question is missing references rather than missing comparisons.
Pair this skill with:
literature-review-sprintbefore this skill when the competing paper map is incompletealgorithm-design-plannerwhen the closest baseline changes the method designexperiment-design-plannerafter this skill to turn selected baselines into a concrete experiment matrixrun-experimentonly after baseline scope, fairness rules, and stop conditions are clearresult-diagnosiswhen baseline results are surprising, unstable, or stronger than the proposed methodpaper-evidence-boardwhen baseline risks must be linked to paper claims, figures, and sectionsresearch-project-memorywhen baseline decisions, risks, and actions should persist across sessions
Skill Directory Layout
<installed-skill-dir>/
├── SKILL.md
└── references/
├── baseline-taxonomy.md
├── fairness-ledger.md
├── memory-writeback.md
├── report-template.md
└── reviewer-risk.md
Progressive Loading
- Always read
references/baseline-taxonomy.md,references/fairness-ledger.md, andreferences/reviewer-risk.md. - Read
references/report-template.mdbefore writing the final audit. - Read
references/memory-writeback.mdwhen the project hasmemory/, component.agent/folders, or the user asks for persistent project memory. - If the baseline set depends on current SOTA, recent concurrent work, or venue expectations, verify with current sources through web search, OpenReview, proceedings, arXiv, PMLR, ACL Anthology, CVF, DBLP, Semantic Scholar, or user-provided papers.
- If current verification is unavailable, mark baseline status as provisional and identify the missing search needed before final experiment planning.
Core Principles
- Baselines exist to defend a claim, not to decorate a table.
- Separate closest conceptual competitor, strongest empirical baseline, standard benchmark baseline, ablation baseline, and control baseline.
- A baseline can be missing for citation purposes, comparison purposes, or both. Name which one.
- Fairness must cover data, model size, compute, tuning, metric, protocol, code availability, and reporting.
- Do not ask the user to run every possible baseline. Rank by reviewer impact and decision value.
- Excluding a baseline requires a defensible reason and often a citation or limitation statement.
- A strong baseline beating the method is project information, not merely an experiment failure.
- The output must hand off directly to
experiment-design-planner.
Step 1 - Recover Claim and Comparison Surface
Collect:
- paper claim or experiment claim
- proposed method and closest baseline, if known
- target task, dataset, benchmark, metric, and protocol
- target venue or community expectations
- existing results, draft tables, or planned experiments
- literature-review outputs, if available
- code availability and compute budget
- project memory IDs such as
CLM-###,EVD-###,RSK-###, orACT-###
Rewrite the claim into:
We need to show that [method] improves [property] over [comparison set] under [task/protocol], without the result being explained by [confound].
If this cannot be written, route to research-idea-validator, algorithm-design-planner, or paper-evidence-board.
Step 2 - Build Candidate Baseline Pool
Use:
- literature review outputs
- cited related work
- benchmark leaderboards or official baselines
- recent accepted papers at the target venue
- code repositories or model releases
- reviewer comments, if this is rebuttal mode
Classify each candidate using references/baseline-taxonomy.md.
The pool should include:
- direct competitor
- strongest current method
- standard benchmark baseline
- classic baseline
- previous version or nearest ablation of the user's method
- no-method or trivial control baseline
- oracle, upper bound, or diagnostic baseline when appropriate
- resource-matched baseline
- domain-specific baseline expected by the venue
Step 3 - Assign Baseline Requirement Level
For each candidate, assign exactly one:
must-have: paper is hard to defend without itshould-have: materially improves reviewer confidence, but omission may be defensibleoptional: useful context, low acceptance impactnot-comparable: related but unfair or invalid as a direct comparisoncitation-only: should be discussed/cited but does not need an experiment
Every must-have baseline needs an owner, experiment form, fairness constraints, and fallback if impossible.
Every not-comparable baseline needs a reason:
- different task or data
- incompatible metric
- unavailable code and reproduction too expensive
- different resource regime
- uses extra supervision or data
- no public details sufficient for faithful reproduction
- evaluates a different claim
Step 4 - Audit Fairness
Read references/fairness-ledger.md.
For each must-have and should-have baseline, check:
- same data split and preprocessing
- same training data and extra-data policy
- comparable model size or explicit scale control
- comparable compute or explicit compute-normalized metric
- comparable tuning budget
- comparable evaluation metric and decoding/sampling protocol
- correct official code or faithful reimplementation
- enough seeds, confidence intervals, or variance reporting
- same reporting unit: tokens, examples, images, FLOPs, wall-clock, NFE, parameters, or memory
If fairness cannot be achieved, decide whether to:
- change claim
- add a matched subset comparison
- run a smaller diagnostic comparison
- mark baseline as citation-only with clear limitation
- defer to rebuttal risk
Step 5 - Forecast Reviewer Attacks
Read references/reviewer-risk.md.
For each missing, weak, or unfair baseline, write the likely reviewer objection:
Reviewer could say: [attack].
Severity: fatal / major / medium / minor
Mitigation: run / cite / justify / narrow claim / move to appendix / accept risk
Prioritize by acceptance impact:
- fatal novelty or comparison threat
- required benchmark/SOTA omission
- unfair tuning or compute
- weak ablation baseline
- unclear protocol
- missing control
Step 6 - Produce Experiment Handoff
For experiment-design-planner, output:
- selected baselines and requirement levels
- exact comparison table rows
- fairness ledger fields to log
- metrics and protocol constraints
- ablation/control baselines
- stop conditions
- expected reviewer question each baseline answers
- fallback plan if a baseline is impossible
If compute is limited, propose a staged plan:
- minimal reviewer-proof set
- high-impact optional additions
- appendix or deferred baselines
Step 7 - Write the Baseline Audit Report
Read references/report-template.md.
If saving to a project and no path is given, use:
docs/experiments/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:
docs/reports/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
The report must include:
- claim under audit
- candidate baseline pool
- requirement-level table
- fairness ledger
- reviewer attack forecast
- selected experiment matrix handoff
- baselines excluded and why
- memory update section
Step 8 - Write Back to Project Memory
Read references/memory-writeback.md when memory exists.
Update the smallest useful set of entries:
memory/risk-board.md: missing, unfair, unavailable, or not-comparable baseline risksmemory/evidence-board.md: planned baseline comparisons and ablationsmemory/action-board.md: implementation, run, citation, or justification actionsmemory/claim-board.md: claims narrowed by baseline feasibilitymemory/decision-log.md: durable decisions to include, exclude, or stage baselines- worktree
.agent/worktree-status.md: baseline implementation purpose and exit condition paper/.agent/: table/section implications when a draft exists
Use certainty labels:
verifiedfor baselines checked against primary sources or official codeuser-statedfor constraints supplied by the userinferredfor reviewer risks and fairness judgmentsunverifiedfor candidates not yet checked
Final Sanity Check
Before finalizing:
- every paper claim has at least one direct comparison or control
- closest conceptual competitor and strongest empirical baseline are not conflated
must-havebaselines are explicit- excluded baselines have defensible reasons
- fairness constraints are concrete enough to run
- reviewer attacks are written in reviewer language
- the output can feed directly into
experiment-design-planner - project memory is updated when present
More from a-green-hand-jack/ml-research-skills
project-init
Initialize an ML research project control root. Use for paper/code/slides repos, shared memory, GitHub Project alignment, agent guidance, worktree policy, and lifecycle handoffs.
37project-sync
Sync verified code-side experiment results into paper memory. Use when logs, reports, run docs, or user-confirmed metrics should become paper-facing evidence.
36add-git-tag
Create annotated Git milestone tags. Use when completing a phase, releasing a version, or marking a research checkpoint.
36update-docs
Refresh project documentation after code changes. Use after implementing features, changing behavior, or preparing a milestone commit.
36init-latex-project
Initialize a LaTeX academic paper project. Use for new conference or journal papers needing templates, macros, venue preambles, and writing guidance.
36new-workspace
Create Git branches or worktrees for research code and paper versions. Use for experiments, baselines, rebuttal fixes, arXiv/camera-ready branches, and worktree memory.
36