ab-test-analysis
A/B Test Analysis
When to use
- An experiment has finished and the team needs a ship / no-ship recommendation
- Results look directionally positive but the team is unsure if they're statistically significant
- A test has been running for weeks without a clear winner and someone needs to decide whether to continue
- A new experiment needs sample-size planning before launch
- Results are disputed and need a rigorous, documented analysis
Process
- Confirm test design — verify the hypothesis, the control and treatment definitions, the randomisation unit (user/session/device), the primary metric, any guardrail metrics, and the target split ratio.
- Check for sample ratio mismatch (SRM) — run a chi-square test on the actual vs. expected split. If SRM is detected, stop and investigate the randomisation pipeline before interpreting results. Use
scripts/ab_test_analyzer.py --check-srm. - Calculate per-variant metrics — compute the rate (or mean) and 95% confidence interval for the primary metric in each variant. Document absolute and relative difference.
- Run the significance test — execute a two-proportion z-test (for rates) or Welch's t-test (for means). Record z-score, p-value, and 95% CI for the effect. Use
references/statistical_tests_reference.mdif unsure which test applies. - Check guardrail metrics — run the same significance test for each guardrail metric. A significant degradation on any guardrail is a blocker regardless of primary metric results.
- Produce the recommendation — synthesise SRM result, power, significance, and guardrail checks into a clear ship / no-ship / extend decision. Quantify the expected business impact if shipped. Record in
assets/ab_test_report_template.md.
Inputs the skill needs
- Test plan or hypothesis document (variant definitions, randomisation unit, primary metric)
- Data with at minimum: user_id, variant assignment, primary metric outcome
- Optional: guardrail metric values per user, daily aggregate data for temporal validity checks
- Target split ratio (e.g., 50/50)
- Minimum detectable effect or business threshold for "worth shipping"
Output
scripts/ab_test_analyzer.py— runs SRM check, significance test, power analysis, and guardrail checks from a CSV or summary stats inputreferences/statistical_tests_reference.md— which test to use and whenreferences/ab_test_design_guide.md— SRM causes, power planning, peeking and multiple testingassets/ab_test_report_template.md— structured report: design, results, checks, recommendation, expected impact
More from nimrodfisher/data-analytics-skills
funnel-analysis
Conversion funnel analysis with drop-off investigation. Use when analyzing multi-step processes, identifying conversion bottlenecks, comparing segments through a funnel, or optimizing user journeys.
37metric-reconciliation
Cross-source metric validation and discrepancy investigation. Use when metrics from different sources don't match, investigating data quality issues between systems, or validating data migration accuracy.
31insight-synthesis
Transform data findings into compelling insights. Use when converting analysis results into actionable insights, connecting findings to business impact, or preparing insights for stakeholder communication.
31dashboard-specification
Design specifications for effective dashboards. Use when planning new dashboards, improving existing ones, or documenting dashboard requirements before development starts.
30data-quality-audit
Comprehensive data quality assessment against business rules, schema constraints, and freshness expectations. Activate when validating data pipeline outputs before production use, auditing a dataset against defined business rules, or producing a quality scorecard for a data asset.
30root-cause-investigation
Systematic investigation of metric changes and anomalies. Use when a metric unexpectedly changes, investigating business metric drops, explaining performance variations, or drilling into aggregated metric drivers.
30