ab-test-planner
A/B Test Planner Skill
Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.
Required Inputs
Ask the user for these if not provided:
- What is being tested (feature, UI change, copy, pricing, onboarding step)
- Hypothesis (or ask to help formulate one)
- Primary metric (conversion rate, click-through, completion rate, etc.)
- Baseline rate and minimum detectable effect (MDE)
- Daily eligible users (to calculate duration)
Experiment Design Checklist
Before running any test, confirm:
- Clear hypothesis with predicted direction
- Single primary metric (plus up to 2 guardrail metrics)
- Minimum detectable effect (MDE) defined
- Sample size calculated
- Test duration estimated
- Segment isolated (no overlap with other running tests)
- Rollback plan defined
Hypothesis Template
"We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]."
Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis.
Sample Size Calculator Logic
Use this formula (provide the output, not the formula, to the user):
- Baseline conversion rate: Current rate of primary metric
- MDE: Smallest change worth detecting (recommend 10–20% relative lift for most features)
- Statistical power: 80% (standard)
- Significance level: 95% (p < 0.05)
For common scenarios, provide pre-calculated estimates:
| Baseline Rate | MDE (Relative) | Required Sample per Variant |
|---|---|---|
| 5% | 20% | ~19,000 |
| 10% | 15% | ~14,000 |
| 20% | 10% | ~15,000 |
| 40% | 10% | ~9,500 |
| 60% | 5% | ~42,000 |
Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision."
Test Duration Guidance
Minimum: 2 full weeks (to capture weekly seasonality) Maximum: 4 weeks (novelty effect distorts results beyond this)
Duration = Required sample ÷ (Daily traffic × % exposed)
Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research).
Output Format
A/B Test Plan — [Test Name] — [Date]
Hypothesis:
[Filled hypothesis template]
Variants:
- Control (A): [Current experience]
- Treatment (B): [Changed experience — be specific]
Primary Metric: [Metric name + how measured] Guardrail Metrics: [Metrics that must not degrade]
Target Segment: [Who sees the test — % of traffic, user type] Traffic Split: [50/50 recommended unless ramp-up needed]
Sample Size Required: ~[N] users per variant Estimated Duration: [X] weeks (based on [Y] daily eligible users) Significance Threshold: 95% confidence, 80% power
Exclusions: [Any user segments to exclude and why]
Rollback Trigger: If [guardrail metric] degrades by [X%], stop the test immediately.
Results Interpretation Guide:
- ✅ Ship if: Treatment shows [X%]+ lift on primary metric at 95% confidence AND guardrail metrics are stable
- 🔄 Iterate if: Direction is positive but not significant — consider extending or redesigning
- ❌ Reject if: No lift or negative direction at significance
- ⚠️ Inconclusive: Do not ship. Do not call it a win.
Guidelines
- Always recommend against peeking at results before the test reaches planned sample size — explain p-hacking risk
- If user wants to test multiple variants, explain the multiple comparisons problem and recommend a Bonferroni correction or a Bayesian approach
- If traffic is very low (<1,000 users/day), recommend qualitative alternatives: moderated testing, 5-second tests, or user interviews
- Never approve a test with no guardrail metrics — always protect revenue, retention, or core engagement
Quality Checks
- Hypothesis is directional (predicts a specific direction and magnitude, not "let's see")
- Primary metric is singular (guardrail metrics are secondary)
- Sample size is calculated from actual MDE and baseline (not guessed)
- Test duration accounts for weekly seasonality (minimum 2 weeks)
- Guardrail metrics are defined (at least one to protect revenue or core engagement)
- Rollback trigger is specified with a concrete threshold
More from mohitagw15856/pm-claude-skills
user-research-synthesis
Analyze and synthesize user research findings into structured, actionable insights. Use when given user research data, interview transcripts, survey results, or user feedback that needs to be analyzed and summarised. Produces a themed synthesis with prevalence data, supporting quotes, pain points analysis, feature request prioritisation, and recommended next steps.
26prd-template
Create a Product Requirements Document following proven PM template structure. Use when asked to write a PRD, product spec, feature specification, or requirements document for a new feature or product. Produces a complete PRD with problem statement, user stories, functional requirements, technical considerations, and success metrics.
20stakeholder-update
Create executive stakeholder updates following proven communication frameworks. Use when the user needs to create a status update, progress report, executive summary, or communication for leadership, stakeholders, or executives.
19competitive-analysis
Analyze competitors and create competitive landscape documentation with feature matrices, positioning maps, and strategic recommendations. Use when asked to analyze competitors, create competitive analysis, compare features with competitors, build a competitive landscape, track competitive positioning, or prepare sales battlecard inputs. Produces structured competitor profiles, feature comparison matrix, win/loss analysis, and prioritised strategic recommendations.
18meeting-notes
Structure and format meeting notes following PM best practices. Use when asked to create meeting notes, format discussion notes, capture action items, or document decisions from any meeting type. Produces structured notes with decisions, action items (owner + deadline), open questions, and next steps.
17executive-summary
Write an executive summary for any document, report, or proposal. Use when asked to write an executive summary, management summary, briefing paper, or one-pager for senior stakeholders. Produces a structured summary that busy executives can read in under 3 minutes and act on.
15