experiment-design-planner
Experiment Design Planner
Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.
Use this skill when:
- a user is about to run a new experiment or ablation
- a paper claim needs evidence
- baselines, metrics, controls, or datasets are unclear
- the user is changing too many variables at once
- cluster/compute time should not be wasted on ambiguous runs
- reviewer-proof evidence is needed before submission
Pair this skill with:
research-project-memorywhen the experiment plan should become project-level evidence, risk, and action memoryrun-experimentafter the design is ready to executeexperiment-report-writerafter results existpaper-reviewer-simulatorto stress-test whether the evidence will satisfy reviewersbaseline-selection-auditbefore finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper reviewfigure-results-reviewafter plotted or tabulated results exist and need claim-support review
Skill Directory Layout
<installed-skill-dir>/
├── SKILL.md
└── references/
├── ablation-matrix.md
├── evidence-standards.md
├── metrics-and-controls.md
└── report-template.md
Progressive Loading
- Always read
references/evidence-standards.mdandreferences/metrics-and-controls.md. - Read
references/ablation-matrix.mdwhen the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes. - Use
references/report-template.mdwhen saving or returning a substantial experiment plan. - If the target repo has
memory/, update planned evidence, experiment families, risks, and actions usingresearch-project-memoryconventions. - If the experiment depends on current baselines, benchmarks, or leaderboard conventions, verify current sources with web search.
Core Principles
- Start from the claim, not the command line.
- State the hypothesis before running experiments.
- Use a baseline before introducing a new method.
- Change one variable at a time unless the experiment is explicitly factorial.
- Define controls and nuisance variables before interpreting results.
- Make negative results useful by defining falsification and fallback decisions.
- Design the table or figure before running the experiment.
- Stop conditions matter: decide what result is enough to move on.
Step 1 - Define the Claim and Question
Extract:
- paper or project claim
- research question
- target audience: internal debugging, advisor update, paper evidence, rebuttal, benchmark claim
- expected output: Markdown plan, LaTeX experiment section outline, run matrix, or saved file
- experiment mode:
single: one controlled comparisonablation: component or variable isolationbenchmark: compare methods across datasets/taskstheory: empirical support for a theoretical predictiondiagnostic: understand a failure mode or surprising result
Rewrite vague goals into testable questions:
Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?
Step 2 - State Hypotheses
Write:
- primary hypothesis
- alternative explanations
- expected metric direction and rough effect size
- falsification condition
- decision rule
If the user cannot state a falsification condition, the experiment is not ready.
Step 3 - Define Evidence Standard
Read references/evidence-standards.md.
Decide what evidence is needed:
- one table, one curve, one ablation, one qualitative example, one theorem-aligned diagnostic, or a benchmark suite
- number of datasets/tasks
- number of seeds or repeats
- required baselines
- acceptable variance
- whether statistical testing or confidence intervals are needed
- whether results must support a paper claim or only guide next steps
Step 4 - Choose Baselines and Controls
Identify:
- primary baseline
- strongest prior method or current SOTA, if relevant
- simple baseline
- ablation baseline
- oracle or upper bound, if useful
- controlled variables
- nuisance variables
If no baseline exists, make the first experiment a baseline-establishment experiment.
Step 5 - Choose Metrics and Logging
Read references/metrics-and-controls.md.
For each metric, specify:
- definition
- direction
- aggregation
- split
- variance reporting
- failure interpretation
- why it answers the question
Define required logging:
- command
- config path
- git commit
- dataset version
- seed
- hyperparameters
- hardware/runtime
- metrics
- artifacts: tables, figures, checkpoints, logs
Step 6 - Build Run Matrix
Read references/ablation-matrix.md when there is more than one run.
Create a run table with:
- run ID
- changed variable
- fixed controls
- dataset/split
- metric
- seed/repeats
- expected result
- status
- output path
Split experiments if a run changes more than one conceptual variable.
Step 7 - Define Stop Conditions and Next Decisions
Write:
- what result is sufficient to support the claim
- what result falsifies or weakens the claim
- what result triggers another ablation
- what result means stop and write/report
- compute budget ceiling
- deadline constraints
Step 8 - Reviewer Risk Check
Before finalizing, ask:
- Would a reviewer complain that the baseline is weak?
- Is the comparison fair?
- Are seeds/repeats enough?
- Does the experiment isolate the claimed mechanism?
- Are metrics aligned with the claim?
- Is there a confounder that could explain the result?
- Would a negative result still teach something?
If the answer exposes a major weakness, update the design before execution.
Step 9 - Write the Experiment Plan
Use references/report-template.md.
If saving to a project and no path is given, use:
docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md
If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:
docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md
The final plan should be runnable by run-experiment and later reportable by experiment-report-writer.
Step 10 - Write Back to Project Memory
If the project uses research-project-memory, update:
memory/evidence-board.md: plannedEVD-###items andEXP-###experiment familiesmemory/provenance-board.md: planned source classes, expected CSV/report outputs, and aggregation requirements when knownmemory/claim-board.md: linked claims, markingplanned,evidence-needed, orprovisionalclaims honestlymemory/risk-board.md: baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the designmemory/action-board.md: runnable next actions, including which experiment to launch firstmemory/handoff-board.md: create a ready handoff torun-experimentwhen the plan is runnablememory/phase-dashboard.md: update the active experiment-design or evidence-production gate- relevant worktree
.agent/worktree-status.md: experiment purpose and exit condition if a branch/worktree is involved
Use planned status for experiments that have not run. Do not record expected outcomes as observed evidence.
Final Sanity Check
Before finalizing:
- claim and hypothesis are explicit
- baseline is defined
- independent variable is isolated
- controls and nuisance variables are listed
- metrics are tied to the question
- run matrix is concrete
- logging requirements are sufficient for reproduction
- stop condition and decision rule are explicit
- reviewer risks are stated
- project memory is updated when the repo has
memory/
More from a-green-hand-jack/ml-research-skills
project-init
Initialize an ML research project control root. Use for paper/code/slides repos, shared memory, GitHub Project alignment, agent guidance, worktree policy, and lifecycle handoffs.
37project-sync
Sync verified code-side experiment results into paper memory. Use when logs, reports, run docs, or user-confirmed metrics should become paper-facing evidence.
36add-git-tag
Create annotated Git milestone tags. Use when completing a phase, releasing a version, or marking a research checkpoint.
36update-docs
Refresh project documentation after code changes. Use after implementing features, changing behavior, or preparing a milestone commit.
36init-latex-project
Initialize a LaTeX academic paper project. Use for new conference or journal papers needing templates, macros, venue preambles, and writing guidance.
36new-workspace
Create Git branches or worktrees for research code and paper versions. Use for experiments, baselines, rebuttal fixes, arXiv/camera-ready branches, and worktree memory.
36