ab-test-setup

SKILL.md

A/B Test Setup - Experimentation Design & Analysis

Category: Product Team Tags: A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing

Overview

A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.


The Experiment Lifecycle

1. HYPOTHESIZE  →  2. DESIGN  →  3. CALCULATE  →  4. IMPLEMENT
       ↑                                                    │
       │                                                    ▼
7. ITERATE  ←  6. DOCUMENT  ←  5. ANALYZE  ←  [Run to completion]

Step 1: Hypothesis Formulation

The Hypothesis Template

Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].

We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.

Good vs Bad Hypotheses

Quality Hypothesis Problem
Bad "Changing the button color might increase clicks" No data basis, no target, no measurement plan
Mediocre "A green button will get more clicks than blue" No "why", no target size, no guardrails
Good "Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s." Data-backed, specific change, measurable outcome, defined audience, guardrail

Hypothesis Sources (Where to Find Test Ideas)

Source What to Look For Example
Analytics data Drop-off points, low-performing pages "80% of users drop off at step 3 of onboarding"
User research Confusion, frustration, unmet needs "Users don't understand what the product does from the homepage"
Heatmaps/session recordings Ignored elements, rage clicks "Nobody scrolls past the fold on pricing page"
Support tickets Recurring complaints, feature confusion "Users constantly ask how to invite team members"
Competitor analysis Different approaches to same problem "Competitor uses a wizard; we use a form"
Sales objections Common reasons prospects don't convert "Prospects want to see pricing before signing up"

Step 2: Test Design

Test Types

Type Variants Traffic Need Best For
A/B 2 (control + 1 variant) Moderate Single change validation
A/B/n 3+ variants High Comparing multiple approaches
Multivariate (MVT) Combinations of changes Very high Optimizing multiple elements
Split URL Different pages Moderate Major redesigns
Bandit Dynamic allocation Low-moderate Revenue optimization

Default recommendation: Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.

What to Test (By Impact)

Category High Impact Medium Impact Low Impact
Copy Headline/value prop, CTA text Body copy, social proof Microcopy, labels
Design Page layout, above-fold content Visual hierarchy, imagery Color, font size
UX Number of steps, form fields Button placement, navigation Animations, transitions
Pricing Price point, plan names Feature packaging, anchoring Billing frequency display
Social Proof Testimonials vs none, logos Testimonial format, placement Testimonial count

Metric Selection

Every test needs three types of metrics:

Primary Metric (1 only)

  • The single metric that determines success
  • Directly tied to the hypothesis
  • Must be measurable within the test duration
  • Examples: signup rate, click-through rate, purchase rate

Secondary Metrics (2-3)

  • Explain why the primary metric moved
  • Provide context for decision-making
  • Examples: time on page, scroll depth, feature adoption rate

Guardrail Metrics (1-3)

  • Things that must NOT get worse
  • Stop the test if significantly negative
  • Examples: error rate, support ticket volume, page load time, refund rate

Step 3: Sample Size Calculation

Quick Reference Table

Minimum visitors PER VARIANT needed (95% confidence, 80% power):

Baseline Rate 5% Lift 10% Lift 15% Lift 20% Lift 50% Lift
1% 620,000 156,000 70,000 39,000 6,400
2% 305,000 77,000 34,000 19,500 3,200
3% 200,000 51,000 23,000 12,800 2,100
5% 116,000 29,500 13,200 7,500 1,250
10% 54,000 13,800 6,200 3,500 600
20% 24,000 6,200 2,800 1,600 280
50% 6,100 1,600 720 410 75

Duration Calculation

Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page

Minimum duration: 7 days (to capture day-of-week effects) Maximum recommended: 6 weeks (beyond this, external factors contaminate results)

What If You Don't Have Enough Traffic?

Situation Solution
Need 100K visitors, get 5K/week Increase minimum detectable effect (test bolder changes)
Very low traffic (<1K/week) Use qualitative testing (user testing, surveys) instead
Medium traffic (5-20K/week) Run for 4-6 weeks, test big changes only
High traffic (50K+/week) You can test subtle changes, run multiple tests

Step 4: Implementation

Client-Side Implementation

JavaScript modifies the page after initial render.

Pros: Quick to implement, no deploy needed Cons: Can cause flicker (flash of original content), blocked by ad blockers Tools: PostHog, Optimizely, VWO, Google Optimize

Anti-flicker pattern:

// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>

// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');

Server-Side Implementation

Variant determined before page renders. No flicker, no client-side dependency.

Pros: No flicker, not blocked by ad blockers, works for logged-in features Cons: Requires engineering work, deploy needed Tools: PostHog, LaunchDarkly, Split, Unleash, custom feature flags

Basic feature flag pattern:

# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
    # Deterministic hash ensures same user always sees same variant
    hash_input = f"{user_id}:{experiment}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100

    if bucket < 50:
        return "control"
    else:
        return "variant"

Traffic Allocation

Strategy Split When to Use
Standard 50/50 Default. Maximum statistical power.
Conservative 90/10 or 80/20 Risky changes, revenue-impacting tests
Ramped Start 95/5, increase to 50/50 New infrastructure, technical risk

Critical rules:

  • Users must see the same variant on every visit (sticky assignment by user ID or cookie)
  • Allocation must be balanced across time of day and day of week
  • Never change allocation mid-test

Step 5: Running the Test

Pre-Launch Checklist

  • Hypothesis documented with primary metric and minimum detectable effect
  • Sample size calculated, expected duration estimated
  • Both variants implemented and QA'd on all device types
  • Tracking verified (events fire correctly for both variants)
  • No other tests running on the same page/feature
  • Stakeholders informed of test duration and "no peeking" rule
  • External factor calendar checked (no major launches, holidays, press)

During the Test

DO:

  • Monitor for technical errors (variant not rendering, tracking broken)
  • Check that traffic split is balanced daily
  • Document any external events that might affect results

DO NOT:

  • Look at results before reaching sample size ("peeking problem")
  • Make changes to either variant
  • Add traffic from new sources mid-test
  • Stop the test early because one variant "looks like it's winning"

The Peeking Problem (Critical)

Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a 25-40% false positive rate (vs the intended 5%).

Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.

Solutions:

  1. Pre-commit to sample size and do not check results until reached
  2. If you must monitor: use sequential testing methods (group sequential design, always-valid p-values)
  3. Set calendar reminder for expected completion date -- that is when you look

Step 6: Analysis

Analysis Checklist

  1. Did we reach planned sample size? If not, results are preliminary only.
  2. Is it statistically significant? p < 0.05 = 95% confidence the difference is real.
  3. What's the confidence interval? Tells you the range of likely true effect.
  4. Is the effect size meaningful? A 0.1% lift that's "significant" may not be worth implementing.
  5. Are secondary metrics consistent? Do they support the primary result?
  6. Any guardrail violations? Did anything get worse?
  7. Segment analysis: Different results for mobile vs desktop? New vs returning?

Interpreting Results

Result Primary Metric Confidence Action
Clear winner Variant +15%, p < 0.01 High Implement variant
Modest winner Variant +5%, p < 0.05 Medium Implement if easy, else run longer
Flat < 2% difference, p > 0.20 High (no effect) Keep control, test something bolder
Loser Variant -10%, p < 0.05 High Keep control, investigate why
Inconclusive 5% difference, p = 0.08 Low Need more traffic or bolder test
Mixed signals Primary up, guardrail down Investigate Dig into segments, do not ship blindly

Common Analysis Mistakes

Mistake Consequence Prevention
Stopping at first significance 25-40% false positive rate Commit to sample size
Cherry-picking segments Finding "winners" that don't replicate Pre-register segments of interest
Ignoring confidence intervals Overestimating effect size Always report CI alongside p-value
Multiple comparisons Inflated Type I error Bonferroni correction for A/B/n
Survivorship bias Only analyzing users who completed flow Include all users from assignment point
Simpson's paradox Aggregate hides segment reversal Always check key segments

Step 7: Documentation

Every test must be documented, regardless of outcome.

Test Documentation Template

EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]

HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].

VARIANTS:
- Control: [description]
- Variant: [description + screenshot]

METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]

RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]

DECISION: [Ship variant / Keep control / Iterate]

LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]

Experiment Prioritization Framework

ICE Scoring

Factor Score (1-10) Question
Impact How much will this move the metric? Big change to primary KPI = 10
Confidence How sure are we it will work? Strong data supporting hypothesis = 10
Ease How easy is it to implement and measure? Can ship in a day = 10

ICE Score = (Impact + Confidence + Ease) / 3

Rank all test ideas by ICE score. Run highest first.

Test Backlog Template

# Hypothesis Primary Metric ICE Est. Duration Status
1 Larger CTA increases signups Signup rate 8.3 2 weeks Ready
2 Social proof on pricing increases conversion Plan selection rate 7.0 3 weeks Needs design
3 Shorter onboarding increases activation Feature activation 6.7 4 weeks In backlog

Proactive Triggers

  • Someone debates between two design options: propose an A/B test instead of opinionating
  • Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions
  • Pricing page changes discussed: always test pricing changes with guardrail metrics
  • Post-launch of any feature: propose follow-up experiment to optimize
  • "Let's just try it and see": redirect to structured hypothesis before implementation

Related Skills

Skill Use When
analytics-tracking Setting up event tracking that feeds experiment metrics
campaign-analytics Folding experiment results into broader attribution
launch-strategy Testing within a product launch sequence
prompt-engineer-toolkit A/B testing AI prompts in production
Weekly Installs
6
GitHub Stars
38
First Seen
6 days ago
Installed on
opencode6
gemini-cli6
github-copilot6
amp6
cline6
codex6