juma-ab-test-plan
A/B Test Plan
Overview
Creates rigorous, data-driven A/B test plans that go beyond "let's try a different button color." Each test plan includes a clear hypothesis, statistical requirements (sample size, duration), variant design, success criteria, and a results interpretation template — plus client communication templates for explaining outcomes to non-technical stakeholders.
When to Use
- CRO audit has identified testable improvement opportunities
- Client wants to validate a design or messaging change before full rollout
- Campaign performance suggests landing page or creative optimization is needed
- Quarterly planning includes a testing roadmap
- Client asks "should we change X?" — the answer is "let's test it"
Prerequisites
- juma-client-context — business goals, current conversion rates, traffic volumes
- juma-cro-audit (recommended) — provides prioritized test opportunities
- juma-analytics-setup (recommended) — ensures conversion tracking is in place
Process
Step 1: Identify Test Opportunity
From CRO audit findings, campaign data, or client requests:
- What's the problem? — Current state and why it's suboptimal
- What's the evidence? — Data supporting the need for a test (analytics, heatmaps, user feedback)
- What's the potential impact? — Estimated conversion or revenue improvement
- Is it testable? — Sufficient traffic, isolatable variable, measurable outcome
Step 2: Formulate Hypothesis
Use the structured format:
If we [specific change], then [specific metric] will [specific direction by specific amount] because [rationale based on evidence or behavioral principle].
Examples:
- "If we shorten the lead form from 8 fields to 4 fields, then form completion rate will increase by 15% because reducing friction lowers abandonment (Hick's Law)."
- "If we add customer testimonials above the fold on the pricing page, then demo request rate will increase by 10% because social proof reduces purchase anxiety."
Bad hypotheses to avoid:
- "If we make the button green, conversions will increase" (no rationale)
- "If we redesign the page, everything will improve" (not specific)
Step 3: Calculate Sample Size
Determine how many visitors/users are needed per variant:
Inputs required:
- Baseline conversion rate: Current rate for the metric being tested
- Minimum Detectable Effect (MDE): Smallest improvement worth detecting (typically 10-20% relative)
- Statistical significance level: Usually 95% (α = 0.05)
- Statistical power: Usually 80% (β = 0.20)
- Number of variants: Control + 1 or more treatments
See sample-size-reference.md for the complete sample size reference table, test duration calculation formula, and guidance on what to do when sample size is too large.
Minimum test duration: 1 full business cycle (typically 1-2 weeks) regardless of sample size to account for day-of-week effects.
Step 4: Design Variants
For each test variant, document:
- Control (A): Current experience — describe exactly what it is
- Treatment (B): Changed experience — describe exactly what changes
- Isolation: Only ONE variable should differ between control and treatment
- Visual mockup: Screenshot or wireframe of each variant
- Implementation notes: What needs to change technically (HTML, CSS, JS, backend)
For multivariate tests (MVT):
- Document all combinations
- Ensure traffic supports the larger sample size requirement
- Note that MVT requires significantly more traffic than A/B
Step 5: Define Traffic Allocation
- Split ratio: Typically 50/50 for two variants
- Ramp-up plan: Start at 10/90 for 24 hours to catch errors, then scale to 50/50
- Targeting: All visitors or specific segments?
- Exclusions: Returning visitors who saw variant A shouldn't see variant B
- Cookie/session duration: How long does variant assignment persist?
Step 6: Define Success Metrics
| Metric Type | Description | Example |
|---|---|---|
| Primary metric | The one metric that determines the winner | Form completion rate |
| Secondary metrics | Supporting metrics that provide context | Time on page, scroll depth |
| Guardrail metrics | Metrics that must NOT degrade | Revenue per visitor, bounce rate |
- A test is only a winner if the primary metric improves AND guardrail metrics hold
- Document the minimum acceptable threshold for each metric
Step 7: Segmentation Plan
Define segments to analyze post-test:
- Device: Desktop vs mobile vs tablet
- Traffic source: Organic vs paid vs direct vs referral
- Geography: By region if relevant
- User type: New vs returning visitors
- Time: Weekday vs weekend
Note: Segment analysis is exploratory — don't use it to cherry-pick a winner.
Step 8: Test Execution Checklist
- Test tool configured (Google Optimize, VWO, Optimizely, etc.)
- QA on all devices and browsers
- Analytics goals tracking correctly for both variants
- No flicker or layout shift on page load
- Test documented in testing log/roadmap
- Stakeholders informed of test start date and expected duration
- Monitoring plan set (daily check for errors, not for "peeking" at results)
Step 9: Results Interpretation Framework
When the test reaches required sample size:
- Statistical significance: Is p-value < 0.05?
- Practical significance: Is the actual improvement meaningful for the business?
- Confidence interval: What's the range of likely true effect?
- Segment consistency: Does the result hold across key segments?
- Guardrail check: Did any guardrail metrics degrade?
Decision framework:
- Significant + practically meaningful + guardrails hold → Implement winner
- Significant but practically small → Consider implementation cost vs benefit
- Not significant → Inconclusive. Consider: was MDE realistic? Was test duration sufficient?
- Guardrails degraded → Do not implement, even if primary metric improved
Step 10: Client Communication
Prepare results communication in non-technical language. See results-communication-templates.md for the complete win, loss, and inconclusive templates with examples and communication best practices.
Output Format
# A/B Test Plan: [Test Name]
**Client**: [Client Name] | **Date**: [Date] | **Test ID**: [ID]
## Hypothesis
If we [specific change], then [metric] will [direction] by [amount] because [rationale].
## Test Parameters
| Parameter | Value |
|-----------|-------|
| Test page/element | [URL or element] |
| Primary metric | [Metric] |
| Baseline rate | [X%] |
| Minimum Detectable Effect | [X% relative] |
| Required sample size (per variant) | [N] |
| Estimated daily traffic | [N] |
| Estimated test duration | [N days/weeks] |
| Statistical significance target | 95% |
| Statistical power | 80% |
## Variants
### Control (A): Current Experience
[Description + screenshot/mockup]
### Treatment (B): Changed Experience
[Description + screenshot/mockup]
**Isolated variable**: [What exactly changes between A and B]
## Success Criteria
| Metric | Type | Target | Min. Acceptable |
|--------|------|--------|-----------------|
| [Primary metric] | Primary | +[X%] | +[Y%] |
| [Secondary metric] | Secondary | Directional improvement | — |
| [Guardrail metric] | Guardrail | No degradation | >[X%] |
## Traffic Allocation
- Split: [50/50]
- Targeting: [All visitors / Specific segments]
- Ramp-up: [10/90 for 24h, then 50/50]
## Segmentation Analysis Plan
- [ ] Device type
- [ ] Traffic source
- [ ] New vs returning
- [ ] [Other relevant segments]
## Implementation Checklist
- [ ] Test configured in [tool]
- [ ] QA complete (desktop, mobile, tablet)
- [ ] Conversion tracking verified
- [ ] Stakeholders notified
- [ ] Start date: [Date]
- [ ] Expected end date: [Date]
## Results Template
[Pre-formatted table for recording results when test concludes]
| Metric | Control | Treatment | Difference | Significance |
|--------|---------|-----------|------------|--------------|
| [Primary] | — | — | — | — |
| [Secondary] | — | — | — | — |
| [Guardrail] | — | — | — | — |
Common Mistakes
- Peeking at results — Checking daily and stopping early when results look good inflates false positive rates. Wait for full sample size
- Testing too many variables — Multivariate tests need exponentially more traffic. Start with simple A/B tests
- Unrealistic MDE — Hoping to detect a 2% improvement with 500 visitors/day means a test lasting months. Be realistic about detectable effects
- No guardrail metrics — A test that increases signups but tanks revenue is not a win. Always define guardrails
- Ignoring seasonal effects — Don't run tests across major holidays or events. Results will be confounded
- Declaring a winner without significance — "Variant B has a higher number" is not the same as "Variant B is statistically significantly better"
- Not documenting learnings — Every test, win or lose, generates knowledge. Maintain a testing log
Related Skills
- juma-cro-audit — Generates the prioritized test opportunities this skill operationalizes
- juma-campaign-plan — Campaign-level experiments use this test planning framework
- juma-analytics-setup — Conversion tracking must be configured before testing
- juma-reporting — Test results feed into monthly performance reports
- juma-paid-media-plan — Ad creative testing follows this same methodology