matchup-win-probability-sim
Matchup Win Probability Simulator
Table of Contents
Example
Scenario: Yahoo MLB 10-category H2H matchup, Week 8, threshold = 6 of 10. Categories: R, HR, RBI, SB, OBP (hitting); K, ERA, WHIP, QS, SV (pitching). ERA and WHIP are inverse (lower is better).
Inputs (remaining-week projections for both teams; mean = expected output, stddev = uncertainty):
| Cat | Our mean | Our stddev | Opp mean | Opp stddev | Inverse? |
|---|---|---|---|---|---|
| R | 42 | 8 | 38 | 7 | no |
| HR | 12 | 3.5 | 14 | 4 | no |
| RBI | 40 | 9 | 41 | 8 | no |
| SB | 6 | 2.5 | 4 | 2 | no |
| OBP | 0.335 | 0.015 | 0.328 | 0.014 | no |
| K | 55 | 10 | 50 | 9 | no |
| ERA | 3.85 | 0.45 | 4.10 | 0.50 | yes |
| WHIP | 1.22 | 0.08 | 1.28 | 0.09 | yes |
| QS | 4 | 1.5 | 3 | 1.4 | no |
| SV | 2 | 1.2 | 5 | 1.5 | no |
Run both modes with random_seed=42, cat_win_threshold=6, n_simulations=10000:
Monte Carlo output:
matchup_win_probability= 0.612expected_cats_won= 6.18variance_estimate= 2.42 (variance of cats-won count)per_cat_win_probability: R 0.64, HR 0.35, RBI 0.47, SB 0.72, OBP 0.64, K 0.64, ERA 0.65, WHIP 0.68, QS 0.69, SV 0.08
Poisson-binomial output (for comparison, uses the same per-cat win probs as inputs to the PB recurrence):
matchup_win_probability= 0.605expected_cats_won= 6.18 (exact — sum of per-cat probs)variance_estimate= 2.11 (variance of sum of independent Bernoullis)
Interpretation: we are a modest favorite (~61%). SV is a hard-punt cat (8% win). HR is contested but we lean losing. The six most defensible pushes are SB, QS, WHIP, ERA, R, K (and OBP). Downstream mlb-lineup-optimizer uses win_prob = 0.612 to classify us as a favorite → damp variance.
Workflow
Copy this checklist and track progress:
Matchup Win Probability Simulation Progress:
- [ ] Step 1: Validate inputs and cat_list coverage
- [ ] Step 2: Choose sim_mode (monte_carlo or poisson_binomial)
- [ ] Step 3: Apply inverse-cat handling
- [ ] Step 4: Run simulation with seeded RNG
- [ ] Step 5: Compute per-cat and overall win probabilities
- [ ] Step 6: Emit outputs with variance and optional sim_trace
Step 1: Validate inputs
Confirm every cat in cat_list has an entry in both our_per_cat_projection and opp_per_cat_projection, each with {mean, stddev}. Confirm cat_win_threshold <= len(cat_list). Confirm cat_inverse_list ⊆ cat_list. See resources/template.md.
- Every cat in
cat_listhas both sides' projections -
stddev > 0for all cats (zero stddev blocks correct simulation; use small floor if unknown) -
cat_win_thresholdin[1, len(cat_list)] - All
cat_inverse_listentries are valid cat names - If
sim_mode == "monte_carlo",n_simulations >= 1000(10k default; 100k for tight confidence)
Step 2: Choose sim_mode
Default to monte_carlo for operational decisions (full distribution of cats-won). Use poisson_binomial when you need a deterministic, sub-millisecond answer and are willing to assume per-cat independence. See resources/methodology.md.
-
monte_carlo: when you needsim_tracefor audit, when distributions are non-normal, or when per-cat correlations are passed -
poisson_binomial: when you need a fast closed-form approximation, or when calling this skill inside an inner optimization loop
Step 3: Apply inverse-cat handling
For every cat in cat_inverse_list, flip the comparison: our team wins the cat when our draw is less than the opponent's draw (ERA 3.50 beats ERA 4.20). The cleanest implementation is to negate the margin (our_draw - opp_draw) → (opp_draw - our_draw) for inverse cats before counting the win. See resources/methodology.md.
- Identify inverse cats from
cat_inverse_list - Negate margin (or flip comparison) per cat
- Verify the per-cat win prob for a known-inverse cat aligns with intuition
Step 4: Run simulation
With random_seed set, draw paired outcomes (one per team per cat per sim) from the configured distribution (normal default) and score each sim.
- Seed the RNG deterministically from
random_seed - For each of
n_simulations:- Draw
our_draw[cat] ~ Normal(our_mean, our_stddev)for every cat - Draw
opp_draw[cat] ~ Normal(opp_mean, opp_stddev)for every cat - For each cat, compute margin (negated for inverse)
- Count
cats_won = sum(margin > 0)(tie-break rule: exact tie counts as 0.5 or 0; see guardrail #4) - Record
matchup_win = (cats_won >= cat_win_threshold)
- Draw
For Poisson-binomial mode instead: compute per-cat win prob p_i = Φ(margin_mean_i / combined_stddev_i) analytically (negate margin for inverse cats), then apply the PB recurrence to get P(sum >= threshold).
Step 5: Compute outputs
matchup_win_probability= mean ofmatchup_winacross sims (MC) or closed-form PB result.per_cat_win_probability[cat]= mean ofmargin > 0per cat (MC) orΦ(...)per cat (PB).expected_cats_won= mean ofcats_won(MC) orΣ p_i(PB).variance_estimate= variance ofcats_wonacross sims (MC) orΣ p_i(1-p_i)(PB, since cats are modeled as independent Bernoullis).
See resources/methodology.md for derivation.
Step 6: Emit outputs and audit trace
Return the full output dict. If return_sim_trace=true, include the first 100 sims as sim_trace for caller-side audit (showing per-cat draws and the final cats_won vector).
- All outputs present:
matchup_win_probability,per_cat_win_probability,expected_cats_won,variance_estimate - Optional:
sim_trace(first 100 sims only — keep payload small) - Cite
random_seedused so the caller can reproduce - Validate using resources/evaluators/rubric_matchup_win_probability_sim.json. Minimum standard: average score 3.5 or above.
Common Patterns
Pattern 1: MLB Yahoo 5x5 (10 cats)
- cat_list:
[R, HR, RBI, SB, OBP, K, ERA, WHIP, QS, SV] - cat_win_threshold: 6
- cat_inverse_list:
[ERA, WHIP] - Ratio cats needing volume weighting: OBP (weight by PA), ERA (weight by IP), WHIP (weight by IP). Caller should pre-compute stddev that reflects volume — a half-week with 20 IP has larger ratio variance than a full week with 55 IP.
- Typical runtime: 10k sims < 200ms in plain Python
Pattern 2: NBA 9-cat H2H
- cat_list:
[PTS, REB, AST, STL, BLK, 3PM, FG%, FT%, TO] - cat_win_threshold: 5
- cat_inverse_list:
[TO](turnovers — lower is better) - Ratio cats: FG%, FT% (volume-weight by FGA, FTA)
- Special: NBA has higher cat-to-cat correlation than MLB (team usage patterns link PTS-AST-3PM); pass correlation matrix if available
Pattern 3: NHL 10-cat or similar
- cat_list example:
[G, A, +/-, PIM, PPP, SOG, W, GAA, SV%, SO] - cat_win_threshold: 6
- cat_inverse_list:
[GAA](goals against average) - Ratio cats: SV% (volume-weight by SA), GAA (volume-weight by games played)
Pattern 4: Mid-week live matchup (partial week elapsed)
- Caller should pass
{mean, stddev}reflecting remaining-week output plus current running total. That is,mean = running_total + expected_remaining;stddevshrinks as less time remains. cat_positionfrommlb-category-state-analyzerfeeds directly: locked-in cats should have near-zero stddev and a mean far outside the opponent's distribution.
Guardrails
-
Reproducibility requires a seed. Without
random_seed, two calls with identical inputs will return slightly different probabilities (Monte Carlo error). For audit logs and unit tests, always pass a seed. The Poisson-binomial mode is deterministic regardless. -
Monte Carlo standard error. With
n_simulations = Nand true probabilityp, the standard error issqrt(p(1-p)/N). ForN=10000andp=0.5, SE ≈ 0.005. If the caller needs 3-decimal precision, useN >= 100000. -
Inverse cats: negate the margin, not the mean. A common bug is to negate
meanat input time, which causes the Poisson-binomialΦcomputation to flip sign but breaks the stddev interpretation. Preferred: keep inputs in their natural units (ERA = 3.85 stays 3.85) and negate the computed margin at comparison time. See resources/methodology.md. -
Tie-break convention must be stated. If
our_draw == opp_drawfor a cat in a given sim, the convention iscats_won += 0.5for both sides (H2H Cats "ties count as half wins" style) ORcats_won += 0(strict majority). Default:tie_rule = "half"to match Yahoo H2H behavior. Document which rule is in effect. -
Normal distribution assumption breaks for extreme counting cats. Saves and Home Runs are low-count discrete quantities; a normal approximation puts non-trivial mass on negative values. For low-mean counting cats (
mean < 5), the caller can specifydistribution_family = "poisson"per cat; Monte Carlo handles this, Poisson-binomial does not (since PB needs aΦ-based per-cat prob — compute it from the Poisson-normal approximation with continuity correction). -
Ratio cats need volume weighting. OBP, ERA, WHIP are ratios (weighted aggregates over PAs or IP). The stddev of the ratio depends on the volume of observations: few PAs → wide stddev. The caller is responsible for supplying a volume-adjusted stddev (see methodology for the formula
stddev_ratio ≈ σ_per_obs / sqrt(n_obs)). This skill treats the supplied stddev as truth. -
Independence assumption is a simplification. OBP and R are correlated (on-base runners generate runs). The default Monte Carlo assumes independence across cats. If the caller passes a
cat_correlation_matrix(positive semi-definite, dimension equal tolen(cat_list)), Monte Carlo uses it via Cholesky decomposition of the combined covariance. Poisson-binomial cannot accept correlation (the whole point of PB is independent Bernoullis). -
Threshold must match the league format.
cat_win_threshold = 6for 10-cat MLB (strict majority),5for 9-cat NBA, etc. Passing the wrong threshold silently produces a meaningful but wrongmatchup_win_probability. Always confirm the league's tie-break rules for the overall matchup too (some leagues award ties for half-wins in the aggregate count). -
Don't aggregate across distinct matchups. A single-call output answers "this week vs this opponent." Weighting a season-long playoff-probability from weekly win probs is a downstream caller's job (
mlb-playoff-planner). -
Document the variance estimate's meaning.
variance_estimateis the variance of the cats-won count (range 0..N). It is NOT the variance ofmatchup_win_probability. The latter is the MC standard-error variancep(1-p)/N. Both are useful; label them clearly if returning both.
Quick Reference
Core formulas:
Monte Carlo (per sim):
For each cat c:
our_draw[c] ~ Normal(our_mean[c], our_stddev[c])
opp_draw[c] ~ Normal(opp_mean[c], opp_stddev[c])
margin[c] = our_draw[c] - opp_draw[c]
if c in cat_inverse_list: margin[c] *= -1
cat_won[c] = (margin[c] > 0) # or 0.5 if exact tie and tie_rule="half"
cats_won = sum(cat_won across cats)
matchup_won = (cats_won >= cat_win_threshold)
Monte Carlo aggregate (over N sims):
matchup_win_probability = mean(matchup_won)
per_cat_win_probability[c] = mean(cat_won[c])
expected_cats_won = mean(cats_won)
variance_estimate = var(cats_won)
Poisson-Binomial (closed form):
combined_stddev[c] = sqrt(our_stddev[c]^2 + opp_stddev[c]^2)
margin_mean[c] = our_mean[c] - opp_mean[c] # negated for inverse cats
per_cat_win_prob[c] = Φ(margin_mean[c] / combined_stddev[c])
P(exactly k of N wins) via PB recurrence:
P_0(0) = 1
P_i(k) = P_{i-1}(k) * (1 - p_i) + P_{i-1}(k-1) * p_i
matchup_win_probability = Σ_{k >= threshold} P_N(k)
expected_cats_won = Σ p_i
variance_estimate = Σ p_i (1 - p_i)
When to use which mode:
| Need | Use |
|---|---|
| Full distribution of cats-won, audit trace | monte_carlo |
| Sub-millisecond, deterministic, exact PB result | poisson_binomial |
| Per-cat correlations (e.g., OBP-R) | monte_carlo (with correlation matrix) |
| Low-mean counting cats (SV, HR for a short week) | monte_carlo with distribution_family="poisson" |
| Inside an inner optimization loop (thousands of calls) | poisson_binomial |
| Default for weekly strategy | monte_carlo with n_simulations=10000 |
Inputs required:
our_per_cat_projection:dict[cat, {mean: float, stddev: float}]opp_per_cat_projection:dict[cat, {mean: float, stddev: float}]cat_list:list[str]— category names in canonical ordercat_win_threshold:int(6 for 10-cat, 5 for 9-cat)cat_inverse_list:list[str]— cats where lower is bettern_simulations:int(default 10000)random_seed:int(optional but recommended)sim_mode:"monte_carlo"(default) or"poisson_binomial"tie_rule:"half"(default) or"strict"cat_correlation_matrix: optionalfloat[N][N]distribution_family: optional per-catdict[cat, "normal"|"poisson"]return_sim_trace:bool(default false)
Outputs produced:
matchup_win_probability:floatin[0, 1]per_cat_win_probability:dict[cat, float]expected_cats_won:floatvariance_estimate:float(variance of cats-won count)sim_trace: optionallist[dict](first 100 sims; present only ifreturn_sim_trace=true)meta:{sim_mode, n_simulations, random_seed, tie_rule}for audit
Key resources:
- resources/template.md: Input schema, output schema, worked MLB 5x5 example with both modes
- resources/methodology.md: Monte Carlo formalization, Poisson-binomial recurrence, variance derivation, inverse & ratio cat handling, mode-selection criteria
- resources/evaluators/rubric_matchup_win_probability_sim.json: 10 criteria for input-spec correctness, MC accuracy, PB accuracy, inverse-cat handling, ratio-cat volume weighting, threshold application, reproducibility, output completeness, variance estimation, citations
Upstream callers (examples):
mlb-category-state-analyzer— passes remaining-week projections per cat (principles #1, #5, #6 inframeworks/game-theory-principles.md)- Any fantasy
*-category-state-analyzerequivalent for NBA/NHL - Any caller that has per-cat
{mean, stddev}and wants matchup-level win probability