ab-test-setup
A/B Test Setup - Experimentation Design & Analysis
Category: Product Team Tags: A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing
Overview
A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.
The Experiment Lifecycle
1. HYPOTHESIZE → 2. DESIGN → 3. CALCULATE → 4. IMPLEMENT
↑ │
│ ▼
7. ITERATE ← 6. DOCUMENT ← 5. ANALYZE ← [Run to completion]
Step 1: Hypothesis Formulation
The Hypothesis Template
Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].
We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.
Good vs Bad Hypotheses
| Quality | Hypothesis | Problem |
|---|---|---|
| Bad | "Changing the button color might increase clicks" | No data basis, no target, no measurement plan |
| Mediocre | "A green button will get more clicks than blue" | No "why", no target size, no guardrails |
| Good | "Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s." | Data-backed, specific change, measurable outcome, defined audience, guardrail |
Hypothesis Sources (Where to Find Test Ideas)
| Source | What to Look For | Example |
|---|---|---|
| Analytics data | Drop-off points, low-performing pages | "80% of users drop off at step 3 of onboarding" |
| User research | Confusion, frustration, unmet needs | "Users don't understand what the product does from the homepage" |
| Heatmaps/session recordings | Ignored elements, rage clicks | "Nobody scrolls past the fold on pricing page" |
| Support tickets | Recurring complaints, feature confusion | "Users constantly ask how to invite team members" |
| Competitor analysis | Different approaches to same problem | "Competitor uses a wizard; we use a form" |
| Sales objections | Common reasons prospects don't convert | "Prospects want to see pricing before signing up" |
Step 2: Test Design
Test Types
| Type | Variants | Traffic Need | Best For |
|---|---|---|---|
| A/B | 2 (control + 1 variant) | Moderate | Single change validation |
| A/B/n | 3+ variants | High | Comparing multiple approaches |
| Multivariate (MVT) | Combinations of changes | Very high | Optimizing multiple elements |
| Split URL | Different pages | Moderate | Major redesigns |
| Bandit | Dynamic allocation | Low-moderate | Revenue optimization |
Default recommendation: Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.
What to Test (By Impact)
| Category | High Impact | Medium Impact | Low Impact |
|---|---|---|---|
| Copy | Headline/value prop, CTA text | Body copy, social proof | Microcopy, labels |
| Design | Page layout, above-fold content | Visual hierarchy, imagery | Color, font size |
| UX | Number of steps, form fields | Button placement, navigation | Animations, transitions |
| Pricing | Price point, plan names | Feature packaging, anchoring | Billing frequency display |
| Social Proof | Testimonials vs none, logos | Testimonial format, placement | Testimonial count |
Metric Selection
Every test needs three types of metrics:
Primary Metric (1 only)
- The single metric that determines success
- Directly tied to the hypothesis
- Must be measurable within the test duration
- Examples: signup rate, click-through rate, purchase rate
Secondary Metrics (2-3)
- Explain why the primary metric moved
- Provide context for decision-making
- Examples: time on page, scroll depth, feature adoption rate
Guardrail Metrics (1-3)
- Things that must NOT get worse
- Stop the test if significantly negative
- Examples: error rate, support ticket volume, page load time, refund rate
Step 3: Sample Size Calculation
Quick Reference Table
Minimum visitors PER VARIANT needed (95% confidence, 80% power):
| Baseline Rate | 5% Lift | 10% Lift | 15% Lift | 20% Lift | 50% Lift |
|---|---|---|---|---|---|
| 1% | 620,000 | 156,000 | 70,000 | 39,000 | 6,400 |
| 2% | 305,000 | 77,000 | 34,000 | 19,500 | 3,200 |
| 3% | 200,000 | 51,000 | 23,000 | 12,800 | 2,100 |
| 5% | 116,000 | 29,500 | 13,200 | 7,500 | 1,250 |
| 10% | 54,000 | 13,800 | 6,200 | 3,500 | 600 |
| 20% | 24,000 | 6,200 | 2,800 | 1,600 | 280 |
| 50% | 6,100 | 1,600 | 720 | 410 | 75 |
Duration Calculation
Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page
Minimum duration: 7 days (to capture day-of-week effects) Maximum recommended: 6 weeks (beyond this, external factors contaminate results)
What If You Don't Have Enough Traffic?
| Situation | Solution |
|---|---|
| Need 100K visitors, get 5K/week | Increase minimum detectable effect (test bolder changes) |
| Very low traffic (<1K/week) | Use qualitative testing (user testing, surveys) instead |
| Medium traffic (5-20K/week) | Run for 4-6 weeks, test big changes only |
| High traffic (50K+/week) | You can test subtle changes, run multiple tests |
Step 4: Implementation
Client-Side Implementation
JavaScript modifies the page after initial render.
Pros: Quick to implement, no deploy needed Cons: Can cause flicker (flash of original content), blocked by ad blockers Tools: PostHog, Optimizely, VWO, Google Optimize
Anti-flicker pattern:
// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>
// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');
Server-Side Implementation
Variant determined before page renders. No flicker, no client-side dependency.
Pros: No flicker, not blocked by ad blockers, works for logged-in features Cons: Requires engineering work, deploy needed Tools: PostHog, LaunchDarkly, Split, Unleash, custom feature flags
Basic feature flag pattern:
# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
# Deterministic hash ensures same user always sees same variant
hash_input = f"{user_id}:{experiment}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
bucket = int(hash_value[:8], 16) % 100
if bucket < 50:
return "control"
else:
return "variant"
Traffic Allocation
| Strategy | Split | When to Use |
|---|---|---|
| Standard | 50/50 | Default. Maximum statistical power. |
| Conservative | 90/10 or 80/20 | Risky changes, revenue-impacting tests |
| Ramped | Start 95/5, increase to 50/50 | New infrastructure, technical risk |
Critical rules:
- Users must see the same variant on every visit (sticky assignment by user ID or cookie)
- Allocation must be balanced across time of day and day of week
- Never change allocation mid-test
Step 5: Running the Test
Pre-Launch Checklist
- Hypothesis documented with primary metric and minimum detectable effect
- Sample size calculated, expected duration estimated
- Both variants implemented and QA'd on all device types
- Tracking verified (events fire correctly for both variants)
- No other tests running on the same page/feature
- Stakeholders informed of test duration and "no peeking" rule
- External factor calendar checked (no major launches, holidays, press)
During the Test
DO:
- Monitor for technical errors (variant not rendering, tracking broken)
- Check that traffic split is balanced daily
- Document any external events that might affect results
DO NOT:
- Look at results before reaching sample size ("peeking problem")
- Make changes to either variant
- Add traffic from new sources mid-test
- Stop the test early because one variant "looks like it's winning"
The Peeking Problem (Critical)
Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a 25-40% false positive rate (vs the intended 5%).
Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.
Solutions:
- Pre-commit to sample size and do not check results until reached
- If you must monitor: use sequential testing methods (group sequential design, always-valid p-values)
- Set calendar reminder for expected completion date -- that is when you look
Step 6: Analysis
Analysis Checklist
- Did we reach planned sample size? If not, results are preliminary only.
- Is it statistically significant? p < 0.05 = 95% confidence the difference is real.
- What's the confidence interval? Tells you the range of likely true effect.
- Is the effect size meaningful? A 0.1% lift that's "significant" may not be worth implementing.
- Are secondary metrics consistent? Do they support the primary result?
- Any guardrail violations? Did anything get worse?
- Segment analysis: Different results for mobile vs desktop? New vs returning?
Interpreting Results
| Result | Primary Metric | Confidence | Action |
|---|---|---|---|
| Clear winner | Variant +15%, p < 0.01 | High | Implement variant |
| Modest winner | Variant +5%, p < 0.05 | Medium | Implement if easy, else run longer |
| Flat | < 2% difference, p > 0.20 | High (no effect) | Keep control, test something bolder |
| Loser | Variant -10%, p < 0.05 | High | Keep control, investigate why |
| Inconclusive | 5% difference, p = 0.08 | Low | Need more traffic or bolder test |
| Mixed signals | Primary up, guardrail down | Investigate | Dig into segments, do not ship blindly |
Common Analysis Mistakes
| Mistake | Consequence | Prevention |
|---|---|---|
| Stopping at first significance | 25-40% false positive rate | Commit to sample size |
| Cherry-picking segments | Finding "winners" that don't replicate | Pre-register segments of interest |
| Ignoring confidence intervals | Overestimating effect size | Always report CI alongside p-value |
| Multiple comparisons | Inflated Type I error | Bonferroni correction for A/B/n |
| Survivorship bias | Only analyzing users who completed flow | Include all users from assignment point |
| Simpson's paradox | Aggregate hides segment reversal | Always check key segments |
Step 7: Documentation
Every test must be documented, regardless of outcome.
Test Documentation Template
EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]
HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].
VARIANTS:
- Control: [description]
- Variant: [description + screenshot]
METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]
RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]
DECISION: [Ship variant / Keep control / Iterate]
LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]
Experiment Prioritization Framework
ICE Scoring
| Factor | Score (1-10) | Question |
|---|---|---|
| Impact | How much will this move the metric? | Big change to primary KPI = 10 |
| Confidence | How sure are we it will work? | Strong data supporting hypothesis = 10 |
| Ease | How easy is it to implement and measure? | Can ship in a day = 10 |
ICE Score = (Impact + Confidence + Ease) / 3
Rank all test ideas by ICE score. Run highest first.
Test Backlog Template
| # | Hypothesis | Primary Metric | ICE | Est. Duration | Status |
|---|---|---|---|---|---|
| 1 | Larger CTA increases signups | Signup rate | 8.3 | 2 weeks | Ready |
| 2 | Social proof on pricing increases conversion | Plan selection rate | 7.0 | 3 weeks | Needs design |
| 3 | Shorter onboarding increases activation | Feature activation | 6.7 | 4 weeks | In backlog |
Proactive Triggers
- Someone debates between two design options: propose an A/B test instead of opinionating
- Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions
- Pricing page changes discussed: always test pricing changes with guardrail metrics
- Post-launch of any feature: propose follow-up experiment to optimize
- "Let's just try it and see": redirect to structured hypothesis before implementation
Related Skills
| Skill | Use When |
|---|---|
| analytics-tracking | Setting up event tracking that feeds experiment metrics |
| campaign-analytics | Folding experiment results into broader attribution |
| launch-strategy | Testing within a product launch sequence |
| prompt-engineer-toolkit | A/B testing AI prompts in production |