experiment-design-checklist
Experiment Design Checklist
Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.
The Core Principle
Before running ANY experiment, you should be able to answer:
- What specific claim will this experiment support or refute?
- What would convince a skeptical reviewer?
- What could go wrong that would invalidate the results?
Process
Step 1: State the Hypothesis Precisely
Convert your research question into falsifiable predictions:
Template:
If [intervention/method], then [measurable outcome], because [mechanism].
Examples:
- "If we add auxiliary contrastive loss, then downstream task accuracy increases by >2%, because representations become more separable."
- "If we use learned positional encodings, then performance on sequences >4096 tokens improves, because the model can extrapolate beyond training length."
Null hypothesis: What does "no effect" look like? This is what you're trying to reject.
Step 2: Identify Variables
Independent Variables (what you manipulate):
| Variable | Levels | Rationale |
|---|---|---|
| [Var 1] | [Level A, B, C] | [Why these levels] |
Dependent Variables (what you measure):
| Metric | How Measured | Why This Metric |
|---|---|---|
| [Metric 1] | [Procedure] | [Justification] |
Control Variables (what you hold constant):
| Variable | Fixed Value | Why Fixed |
|---|---|---|
| [Var 1] | [Value] | [Prevents confound X] |
Step 3: Choose Baselines
Every experiment needs comparisons. No result is meaningful in isolation.
Baseline Hierarchy:
-
Random/Trivial Baseline
- What does random chance achieve?
- Sanity check that the task isn't trivial
-
Simple Baseline
- Simplest reasonable approach
- Often embarrassingly effective
-
Standard Baseline
- Well-known method from literature
- Apples-to-apples comparison
-
State-of-the-Art Baseline
- Current best published result
- Only if you're claiming SOTA
-
Ablated Self
- Your method minus key components
- Shows each component contributes
For each baseline, document:
- Source (paper, implementation)
- Hyperparameters used
- Whether you re-ran or used reported numbers
- Any modifications made
Step 4: Design Ablations
Ablations answer: "Is each component necessary?"
Ablation Template:
| Variant | What's Removed/Changed | Expected Effect | If No Effect... |
|---|---|---|---|
| Full Model | Nothing | Best performance | - |
| w/o Component A | Remove A | Performance drops X% | A isn't helping |
| w/o Component B | Remove B | Performance drops Y% | B isn't helping |
| Component A only | Only A, no B | Shows A's isolated contribution | - |
Good ablations are:
- Surgical (one change at a time)
- Interpretable (clear what was changed)
- Informative (result tells you something)
Step 5: Address Confounds
Things that could explain your results OTHER than your hypothesis:
Common Confounds:
| Confound | How to Check | How to Control |
|---|---|---|
| Hyperparameter tuning advantage | Same tuning budget for all | Report tuning procedure |
| Compute advantage | Matched FLOPs/params | Report compute used |
| Data leakage | Check train/test overlap | Strict separation |
| Random seed luck | Multiple seeds | Report variance |
| Implementation bugs (baseline) | Verify baseline numbers | Use official implementations |
| Cherry-picked examples | Random or systematic selection | Pre-register selection criteria |
Step 6: Statistical Rigor
Sample Size:
- How many random seeds? (Minimum: 3, better: 5+)
- How many data splits? (If applicable)
- Power analysis: Can you detect expected effect size?
What to Report:
- Mean ± standard deviation (or standard error)
- Confidence intervals where appropriate
- Statistical significance tests if claiming "better"
Appropriate Tests:
| Comparison | Test | Assumptions |
|---|---|---|
| Two methods, normal data | t-test | Normality, equal variance |
| Two methods, unknown dist | Mann-Whitney U | Ordinal data |
| Multiple methods | ANOVA + post-hoc | Normality |
| Multiple methods, unknown | Kruskal-Wallis | Ordinal data |
| Paired comparisons | Wilcoxon signed-rank | Same test instances |
Avoid:
- p-hacking (running until significant)
- Multiple comparison problems (Bonferroni correct)
- Reporting only favorable metrics
Step 7: Compute Budget
Before running, estimate:
| Component | Estimate | Notes |
|---|---|---|
| Single training run | X GPU-hours | [Details] |
| Hyperparameter search | Y runs × X hours | [Search strategy] |
| Baselines | Z runs × W hours | [Which baselines] |
| Ablations | N variants × X hours | [Which ablations] |
| Seeds | M seeds × above | [How many seeds] |
| Total | T GPU-hours | Buffer: 1.5-2x |
Go/No-Go Decision: Is this feasible with available resources?
Step 8: Pre-Registration (Optional but Recommended)
Write down BEFORE running:
- Exact hypotheses
- Primary metrics (not chosen post-hoc)
- Analysis plan
- What would constitute "success"
This prevents unconscious goal-post moving.
Output: Experiment Design Document
# Experiment Design: [Title]
## Hypothesis
[Precise statement]
## Variables
### Independent
[Table]
### Dependent
[Table]
### Controls
[Table]
## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]
## Ablations
[Table]
## Confound Mitigation
[Table]
## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [α level]
## Compute Budget
[Table with total estimate]
## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]
## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]
## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]
Red Flags in Experiment Design
🚩 "We'll figure out the metrics later" 🚩 "One run should be enough" 🚩 "We don't need baselines, it's obviously better" 🚩 "Let's just see what happens" 🚩 "We can always run more if it's not significant" 🚩 No compute estimate before starting 🚩 Vague success criteria
More from ghostscientist/skills
paper-to-intuition
Transforms an academic paper into deep, multi-layered understanding. Use when asked to explain a paper, break down a research paper, understand an arXiv paper, or build intuition for a technical concept from a paper. Generates explanations at multiple levels plus visual intuition diagrams.
48ios-app-icon-generator
Generates a complete iOS app icon set with all required sizes. Use when asked to create an app icon, design an iOS icon, generate app store artwork, or make an icon for an iPhone/iPad app. Follows a philosophy-first approach - first defining the visual identity and concept, then producing production-ready icons.
17research-question-refiner
Helps transform a vague research interest into a concrete, tractable research question. Use when asked to refine a research idea, develop a research question, scope a research project, or figure out what to work on. Walks through systematic refinement with feasibility analysis.
14create-watchos-version
Analyzes existing iOS/macOS/Apple platform projects to create a comprehensive, phased plan for building a watchOS companion or standalone app. Use when users want to add watchOS support to an existing Apple platform app, create a Watch app version of their iOS app, or build watchOS features. The skill digests project architecture, identifies patterns, analyzes API compatibility, searches for current watchOS documentation, and produces a detailed implementation plan with API availability warnings before any code generation.
9implement-paper-from-scratch
Guides you through implementing a research paper step-by-step from scratch. Use when asked to implement a paper, code up a paper, reproduce research results, or build a model from a paper. Focuses on building understanding through implementation with checkpoint questions.
8turn-this-feature-into-a-blog-post
Generates a technical blog post from code implementation. Use when asked to write a blog post about a feature, explain an implementation for a blog, document code as a blog article, or create technical content from source code. Triggers on phrases like "write a blog post about", "turn this into a blog", "create a technical article", or "explain this for a blog".
5