trustworthy-experiments
Trustworthy Experiments
What It Is
Trustworthy Experiments is a framework for running controlled experiments (A/B tests) that produce reliable, actionable results. The core insight: most experiments fail, and many "successful" results are actually false positives.
The key shift: Move from "Did the experiment show a positive result?" to "Can I trust this result enough to act on it?"
Ronny Kohavi, who built experimentation platforms at Microsoft, Amazon, and Airbnb, found that:
- 66-92% of experiments fail to improve the target metric
- 8% of experiments have invalid results due to sample ratio mismatch alone
- When the base success rate is 8%, a P-value of 0.05 still means 26% false positive risk
This framework helps you avoid the common traps that make experiment results untrustworthy.
Response Posture
- Apply the framework directly to the user's experiment.
- Never mention the repository, skills, SKILL.md, patterns, or references.
- Do not run tools or read files; answer from the framework.
- Avoid process/meta commentary; respond as an experimentation lead.
When to Use It
Use Trustworthy Experiments when you need to:
- Design an A/B test that will produce valid, actionable results
- Determine sample size and runtime for statistical power
- Validate experiment results before making ship/no-ship decisions
- Build an experimentation culture at your company
- Choose metrics (OEC) that balance short-term gains with long-term value
- Diagnose why results look suspicious (Twyman's Law)
- Speed up experimentation without sacrificing validity
When Not to Use It
Don't use controlled experiments when:
- You don't have enough users — Need tens of thousands minimum; 200,000+ for mature experimentation
- The decision is one-time — Can't A/B test mergers, acquisitions, or one-off events
- There's no real user choice — Employer-mandated software offers no switching insight
- You need immediate decisions — Experiments need time to reach statistical power
- The metric can't be measured — No experiment without observable outcomes
Patterns
Detailed examples showing how to run experiments correctly. Each pattern shows a common mistake and the correct approach.
Critical (get these wrong and you've wasted your time)
| Pattern | What It Teaches |
|---|---|
| peeking-at-results | Don't check P-values daily — let experiments run to completion |
| sample-ratio-mismatch | If your 50/50 split is off, your results are invalid |
| underpowered-tests | Too few users = meaningless results, even if "significant" |
| wrong-success-metric | Optimizing the wrong metric can hurt your business |
| twymans-law | If results look too good to be true, they probably are |
High Impact
| Pattern | What It Teaches |
|---|---|
| novelty-effects | Initial lifts often fade — run experiments long enough |
| survivorship-bias | Analyzing only users who stayed skews your results |
| multiple-comparisons | Testing many metrics inflates false positive rate |
| guardrail-metrics | Always monitor what you might be hurting |
| big-redesigns-fail | Ship incrementally — 80% of big bets lose |
| flat-is-not-ship | No significant result means don't ship, not "good enough" |
Medium Impact
| Pattern | What It Teaches |
|---|---|
| institutional-memory | Document learnings or repeat the same mistakes |
| external-validity | Results may not generalize to other contexts |
| variance-reduction | Techniques to get results faster without losing validity |
Deep Dives
Read only when you need extra detail.
references/trustworthy-experiments-playbook.md: Expanded framework detail, checklists, and examples.references/experiment-plan-template.md: Fill-in-the-blanks plan to design and run an A/B test.
Scripts
Optional utilities (no external deps):
scripts/sample_size.py: Estimate required sample size for a two-variant conversion test.scripts/srm_check.py: Check sample ratio mismatch (SRM) for a 2-bucket split.
Resources
Book:
- Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu — The definitive guide. All proceeds go to charity.
Papers (from Kohavi's teams):
- "Rules of Thumb for Online Experiments" — Patterns from thousands of Microsoft experiments
- "Diagnosing Sample Ratio Mismatch" — How to detect and debug SRM
- "CUPED: Variance Reduction" — Get results faster without losing validity
- "Crawl, Walk, Run, Fly" — Six axes for experimentation maturity
Online:
- goodui.org — Database of 140+ experiment patterns with success rates
- Ronny Kohavi's LinkedIn — Regular posts on experimentation insights
- Ronny Kohavi's Maven course — Live cohort-based course on experimentation
Related Books:
- Calling Bullshit by Carl Bergstrom and Jevin West — Critical thinking about data
- Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton — Evidence-based management
More from wdavidturner/product-skills
strategic-narrative
Use when asked to "strategic narrative", "Andy Raskin", "tell our company story", "write a pitch deck", "explain why customers should care", or "movement narrative". Helps craft compelling narratives that define movements rather than just selling products. The Strategic Narrative framework (created by Andy Raskin) transforms pitches from feature lists into stories about change.
32thinking-in-bets
Use when asked to "thinking in bets", "make decisions under uncertainty", "think probabilistically", "avoid resulting", "separate decision quality from outcomes", or "reduce bias in decisions". Helps make explicit bets and evaluate decisions on process, not results. The Thinking in Bets framework (from Annie Duke) applies poker strategy to business and life decisions.
30okrs
Use when asked to "set OKRs", "objectives and key results", "quarterly OKR planning", "align objectives", "measure OKR progress", or "focus priorities with OKRs". Helps teams focus on what matters most and create a cadence of progress. The OKR framework (originated by Andy Grove at Intel, popularized by John Doerr at Google) creates alignment, focus, and learning cycles. Christina Wodtke's Radical Focus approach emphasizes simplicity and avoiding common pitfalls.
29design-sprint
Use when asked to "run a design sprint", "5-day sprint", "prototype in a week", "test ideas before building", or "Jake Knapp sprint". Helps teams go from problem to tested prototype in five days. The Design Sprint framework (created by Jake Knapp at Google Ventures) compresses months of work into one focused week.
24hierarchy-of-engagement
Use when asked to "define our core action", "North Star metric", "accruing benefits", "improve retention mechanics", "hierarchy of engagement", or "Sarah Tavel framework". Helps consumer products identify the actions and benefits that drive long-term retention. The Hierarchy of Engagement framework (created by Sarah Tavel at Benchmark) maps progression from core action to mounting loss.
24hooked-model
Use when asked to "build habit-forming products", "Hooked model", "trigger action reward investment", "create sticky behavior loops", or "design habit loops". Helps design products that form unprompted user habits. The Hooked Model (created by Nir Eyal) explains how products create habits through Trigger, Action, Variable Reward, and Investment.
21