A/B Testing Statistics

Framework

IRON LAW: Calculate Sample Size BEFORE Running the Test

Running a test without knowing the required sample size leads to two
failures: stopping too early (false positives) or running too long (waste).

Required inputs: baseline conversion rate, minimum detectable effect (MDE),
significance level (α), power (1-β). Calculate BEFORE starting.

Sample Size Formula (Proportions)

n per group ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²

Quick reference (α=0.05, power=0.8):

Baseline Rate	MDE (relative)	N per Group
5%	10% (→5.5%)	~58,000
5%	20% (→6.0%)	~15,000
10%	10% (→11%)	~15,000
10%	20% (→12%)	~4,000

Testing Approaches

Approach	How It Works	Best When
Frequentist (fixed-horizon)	Set sample size, run to completion, then analyze	Standard practice, well-understood
Bayesian	Update beliefs with data, compute probability of improvement	Want probability statements ("90% chance B is better")
Sequential testing	Check results at intervals with adjusted thresholds	Need to stop early if clear winner, or limit downside risk

Experiment Design Checklist

Hypothesis: What do you expect to happen and why?
Primary metric: ONE key metric (conversion, revenue, retention)
Guardrail metrics: Metrics that must NOT degrade (page load time, error rate)
Randomization unit: User, session, or device?
Sample size: Calculated from baseline, MDE, α, power
Duration: Account for weekly cycles (minimum 1-2 full weeks)
Stopping rules: Pre-defined — do NOT peek and stop early without correction

Analysis Steps

Check randomization balance (are groups comparable on pre-treatment metrics?)
Calculate observed difference and confidence interval
Run significance test (z-test for proportions, t-test for continuous)
Check guardrail metrics
Interpret with practical significance in mind

Output Format

# A/B Test Design: {Experiment Name}

## Hypothesis
- H₀: {no difference}
- H₁: {expected improvement}
- Primary metric: {metric}
- MDE: {X% relative}

## Sample Size
- Baseline rate: {X%}
- Required N per group: {N}
- Estimated duration: {days/weeks}

## Results (post-test)
| Metric | Control | Treatment | Diff | CI (95%) | p-value |
|--------|---------|-----------|------|----------|---------|
| {primary} | X% | X% | +X% | [X, X] | {value} |

## Decision
{Ship / Don't ship / Extend test} — {rationale}

Gotchas

Peeking inflates false positives: Checking results daily and stopping when p < 0.05 can produce a 30%+ false positive rate. Use sequential testing methods if you need to peek.
Novelty effect: New features may show a lift that fades as users get used to them. Run tests long enough (2+ weeks) to stabilize.
Simpson's paradox: An overall positive result can be negative in every subgroup (or vice versa). Segment by key dimensions.
Network effects / interference: If treatment users interact with control users (social features, marketplace), independence is violated. Use cluster randomization.
Statistical significance threshold is arbitrary: α=0.05 is convention, not truth. For high-stakes decisions (pricing, major UX changes), consider α=0.01.

References

For Bayesian A/B testing methodology, see references/bayesian-ab.md
For multi-armed bandit approach, see references/bandits.md

stat-ab-testing

A/B Testing Statistics

Framework

Sample Size Formula (Proportions)

Testing Approaches

Experiment Design Checklist

Analysis Steps

Output Format

Gotchas

References