skills/lyndonkl/claude/matchup-win-probability-sim

matchup-win-probability-sim

Installation
SKILL.md

Matchup Win Probability Simulator

Table of Contents

Example

Scenario: Yahoo MLB 10-category H2H matchup, Week 8, threshold = 6 of 10. Categories: R, HR, RBI, SB, OBP (hitting); K, ERA, WHIP, QS, SV (pitching). ERA and WHIP are inverse (lower is better).

Inputs (remaining-week projections for both teams; mean = expected output, stddev = uncertainty):

Cat Our mean Our stddev Opp mean Opp stddev Inverse?
R 42 8 38 7 no
HR 12 3.5 14 4 no
RBI 40 9 41 8 no
SB 6 2.5 4 2 no
OBP 0.335 0.015 0.328 0.014 no
K 55 10 50 9 no
ERA 3.85 0.45 4.10 0.50 yes
WHIP 1.22 0.08 1.28 0.09 yes
QS 4 1.5 3 1.4 no
SV 2 1.2 5 1.5 no

Run both modes with random_seed=42, cat_win_threshold=6, n_simulations=10000:

Monte Carlo output:

  • matchup_win_probability = 0.612
  • expected_cats_won = 6.18
  • variance_estimate = 2.42 (variance of cats-won count)
  • per_cat_win_probability: R 0.64, HR 0.35, RBI 0.47, SB 0.72, OBP 0.64, K 0.64, ERA 0.65, WHIP 0.68, QS 0.69, SV 0.08

Poisson-binomial output (for comparison, uses the same per-cat win probs as inputs to the PB recurrence):

  • matchup_win_probability = 0.605
  • expected_cats_won = 6.18 (exact — sum of per-cat probs)
  • variance_estimate = 2.11 (variance of sum of independent Bernoullis)

Interpretation: we are a modest favorite (~61%). SV is a hard-punt cat (8% win). HR is contested but we lean losing. The six most defensible pushes are SB, QS, WHIP, ERA, R, K (and OBP). Downstream mlb-lineup-optimizer uses win_prob = 0.612 to classify us as a favorite → damp variance.

Workflow

Copy this checklist and track progress:

Matchup Win Probability Simulation Progress:
- [ ] Step 1: Validate inputs and cat_list coverage
- [ ] Step 2: Choose sim_mode (monte_carlo or poisson_binomial)
- [ ] Step 3: Apply inverse-cat handling
- [ ] Step 4: Run simulation with seeded RNG
- [ ] Step 5: Compute per-cat and overall win probabilities
- [ ] Step 6: Emit outputs with variance and optional sim_trace

Step 1: Validate inputs

Confirm every cat in cat_list has an entry in both our_per_cat_projection and opp_per_cat_projection, each with {mean, stddev}. Confirm cat_win_threshold <= len(cat_list). Confirm cat_inverse_list ⊆ cat_list. See resources/template.md.

  • Every cat in cat_list has both sides' projections
  • stddev > 0 for all cats (zero stddev blocks correct simulation; use small floor if unknown)
  • cat_win_threshold in [1, len(cat_list)]
  • All cat_inverse_list entries are valid cat names
  • If sim_mode == "monte_carlo", n_simulations >= 1000 (10k default; 100k for tight confidence)

Step 2: Choose sim_mode

Default to monte_carlo for operational decisions (full distribution of cats-won). Use poisson_binomial when you need a deterministic, sub-millisecond answer and are willing to assume per-cat independence. See resources/methodology.md.

  • monte_carlo: when you need sim_trace for audit, when distributions are non-normal, or when per-cat correlations are passed
  • poisson_binomial: when you need a fast closed-form approximation, or when calling this skill inside an inner optimization loop

Step 3: Apply inverse-cat handling

For every cat in cat_inverse_list, flip the comparison: our team wins the cat when our draw is less than the opponent's draw (ERA 3.50 beats ERA 4.20). The cleanest implementation is to negate the margin (our_draw - opp_draw) → (opp_draw - our_draw) for inverse cats before counting the win. See resources/methodology.md.

  • Identify inverse cats from cat_inverse_list
  • Negate margin (or flip comparison) per cat
  • Verify the per-cat win prob for a known-inverse cat aligns with intuition

Step 4: Run simulation

With random_seed set, draw paired outcomes (one per team per cat per sim) from the configured distribution (normal default) and score each sim.

  • Seed the RNG deterministically from random_seed
  • For each of n_simulations:
    • Draw our_draw[cat] ~ Normal(our_mean, our_stddev) for every cat
    • Draw opp_draw[cat] ~ Normal(opp_mean, opp_stddev) for every cat
    • For each cat, compute margin (negated for inverse)
    • Count cats_won = sum(margin > 0) (tie-break rule: exact tie counts as 0.5 or 0; see guardrail #4)
    • Record matchup_win = (cats_won >= cat_win_threshold)

For Poisson-binomial mode instead: compute per-cat win prob p_i = Φ(margin_mean_i / combined_stddev_i) analytically (negate margin for inverse cats), then apply the PB recurrence to get P(sum >= threshold).

Step 5: Compute outputs

  • matchup_win_probability = mean of matchup_win across sims (MC) or closed-form PB result.
  • per_cat_win_probability[cat] = mean of margin > 0 per cat (MC) or Φ(...) per cat (PB).
  • expected_cats_won = mean of cats_won (MC) or Σ p_i (PB).
  • variance_estimate = variance of cats_won across sims (MC) or Σ p_i(1-p_i) (PB, since cats are modeled as independent Bernoullis).

See resources/methodology.md for derivation.

Step 6: Emit outputs and audit trace

Return the full output dict. If return_sim_trace=true, include the first 100 sims as sim_trace for caller-side audit (showing per-cat draws and the final cats_won vector).

  • All outputs present: matchup_win_probability, per_cat_win_probability, expected_cats_won, variance_estimate
  • Optional: sim_trace (first 100 sims only — keep payload small)
  • Cite random_seed used so the caller can reproduce
  • Validate using resources/evaluators/rubric_matchup_win_probability_sim.json. Minimum standard: average score 3.5 or above.

Common Patterns

Pattern 1: MLB Yahoo 5x5 (10 cats)

  • cat_list: [R, HR, RBI, SB, OBP, K, ERA, WHIP, QS, SV]
  • cat_win_threshold: 6
  • cat_inverse_list: [ERA, WHIP]
  • Ratio cats needing volume weighting: OBP (weight by PA), ERA (weight by IP), WHIP (weight by IP). Caller should pre-compute stddev that reflects volume — a half-week with 20 IP has larger ratio variance than a full week with 55 IP.
  • Typical runtime: 10k sims < 200ms in plain Python

Pattern 2: NBA 9-cat H2H

  • cat_list: [PTS, REB, AST, STL, BLK, 3PM, FG%, FT%, TO]
  • cat_win_threshold: 5
  • cat_inverse_list: [TO] (turnovers — lower is better)
  • Ratio cats: FG%, FT% (volume-weight by FGA, FTA)
  • Special: NBA has higher cat-to-cat correlation than MLB (team usage patterns link PTS-AST-3PM); pass correlation matrix if available

Pattern 3: NHL 10-cat or similar

  • cat_list example: [G, A, +/-, PIM, PPP, SOG, W, GAA, SV%, SO]
  • cat_win_threshold: 6
  • cat_inverse_list: [GAA] (goals against average)
  • Ratio cats: SV% (volume-weight by SA), GAA (volume-weight by games played)

Pattern 4: Mid-week live matchup (partial week elapsed)

  • Caller should pass {mean, stddev} reflecting remaining-week output plus current running total. That is, mean = running_total + expected_remaining; stddev shrinks as less time remains.
  • cat_position from mlb-category-state-analyzer feeds directly: locked-in cats should have near-zero stddev and a mean far outside the opponent's distribution.

Guardrails

  1. Reproducibility requires a seed. Without random_seed, two calls with identical inputs will return slightly different probabilities (Monte Carlo error). For audit logs and unit tests, always pass a seed. The Poisson-binomial mode is deterministic regardless.

  2. Monte Carlo standard error. With n_simulations = N and true probability p, the standard error is sqrt(p(1-p)/N). For N=10000 and p=0.5, SE ≈ 0.005. If the caller needs 3-decimal precision, use N >= 100000.

  3. Inverse cats: negate the margin, not the mean. A common bug is to negate mean at input time, which causes the Poisson-binomial Φ computation to flip sign but breaks the stddev interpretation. Preferred: keep inputs in their natural units (ERA = 3.85 stays 3.85) and negate the computed margin at comparison time. See resources/methodology.md.

  4. Tie-break convention must be stated. If our_draw == opp_draw for a cat in a given sim, the convention is cats_won += 0.5 for both sides (H2H Cats "ties count as half wins" style) OR cats_won += 0 (strict majority). Default: tie_rule = "half" to match Yahoo H2H behavior. Document which rule is in effect.

  5. Normal distribution assumption breaks for extreme counting cats. Saves and Home Runs are low-count discrete quantities; a normal approximation puts non-trivial mass on negative values. For low-mean counting cats (mean < 5), the caller can specify distribution_family = "poisson" per cat; Monte Carlo handles this, Poisson-binomial does not (since PB needs a Φ-based per-cat prob — compute it from the Poisson-normal approximation with continuity correction).

  6. Ratio cats need volume weighting. OBP, ERA, WHIP are ratios (weighted aggregates over PAs or IP). The stddev of the ratio depends on the volume of observations: few PAs → wide stddev. The caller is responsible for supplying a volume-adjusted stddev (see methodology for the formula stddev_ratio ≈ σ_per_obs / sqrt(n_obs)). This skill treats the supplied stddev as truth.

  7. Independence assumption is a simplification. OBP and R are correlated (on-base runners generate runs). The default Monte Carlo assumes independence across cats. If the caller passes a cat_correlation_matrix (positive semi-definite, dimension equal to len(cat_list)), Monte Carlo uses it via Cholesky decomposition of the combined covariance. Poisson-binomial cannot accept correlation (the whole point of PB is independent Bernoullis).

  8. Threshold must match the league format. cat_win_threshold = 6 for 10-cat MLB (strict majority), 5 for 9-cat NBA, etc. Passing the wrong threshold silently produces a meaningful but wrong matchup_win_probability. Always confirm the league's tie-break rules for the overall matchup too (some leagues award ties for half-wins in the aggregate count).

  9. Don't aggregate across distinct matchups. A single-call output answers "this week vs this opponent." Weighting a season-long playoff-probability from weekly win probs is a downstream caller's job (mlb-playoff-planner).

  10. Document the variance estimate's meaning. variance_estimate is the variance of the cats-won count (range 0..N). It is NOT the variance of matchup_win_probability. The latter is the MC standard-error variance p(1-p)/N. Both are useful; label them clearly if returning both.

Quick Reference

Core formulas:

Monte Carlo (per sim):
  For each cat c:
    our_draw[c]  ~ Normal(our_mean[c],  our_stddev[c])
    opp_draw[c]  ~ Normal(opp_mean[c],  opp_stddev[c])
    margin[c]    = our_draw[c] - opp_draw[c]
    if c in cat_inverse_list: margin[c] *= -1
    cat_won[c]   = (margin[c] > 0)     # or 0.5 if exact tie and tie_rule="half"
  cats_won       = sum(cat_won across cats)
  matchup_won    = (cats_won >= cat_win_threshold)

Monte Carlo aggregate (over N sims):
  matchup_win_probability    = mean(matchup_won)
  per_cat_win_probability[c] = mean(cat_won[c])
  expected_cats_won          = mean(cats_won)
  variance_estimate          = var(cats_won)

Poisson-Binomial (closed form):
  combined_stddev[c] = sqrt(our_stddev[c]^2 + opp_stddev[c]^2)
  margin_mean[c]     = our_mean[c] - opp_mean[c]    # negated for inverse cats
  per_cat_win_prob[c] = Φ(margin_mean[c] / combined_stddev[c])
  P(exactly k of N wins) via PB recurrence:
    P_0(0) = 1
    P_i(k) = P_{i-1}(k) * (1 - p_i) + P_{i-1}(k-1) * p_i
  matchup_win_probability = Σ_{k >= threshold} P_N(k)
  expected_cats_won       = Σ p_i
  variance_estimate       = Σ p_i (1 - p_i)

When to use which mode:

Need Use
Full distribution of cats-won, audit trace monte_carlo
Sub-millisecond, deterministic, exact PB result poisson_binomial
Per-cat correlations (e.g., OBP-R) monte_carlo (with correlation matrix)
Low-mean counting cats (SV, HR for a short week) monte_carlo with distribution_family="poisson"
Inside an inner optimization loop (thousands of calls) poisson_binomial
Default for weekly strategy monte_carlo with n_simulations=10000

Inputs required:

  • our_per_cat_projection: dict[cat, {mean: float, stddev: float}]
  • opp_per_cat_projection: dict[cat, {mean: float, stddev: float}]
  • cat_list: list[str] — category names in canonical order
  • cat_win_threshold: int (6 for 10-cat, 5 for 9-cat)
  • cat_inverse_list: list[str] — cats where lower is better
  • n_simulations: int (default 10000)
  • random_seed: int (optional but recommended)
  • sim_mode: "monte_carlo" (default) or "poisson_binomial"
  • tie_rule: "half" (default) or "strict"
  • cat_correlation_matrix: optional float[N][N]
  • distribution_family: optional per-cat dict[cat, "normal"|"poisson"]
  • return_sim_trace: bool (default false)

Outputs produced:

  • matchup_win_probability: float in [0, 1]
  • per_cat_win_probability: dict[cat, float]
  • expected_cats_won: float
  • variance_estimate: float (variance of cats-won count)
  • sim_trace: optional list[dict] (first 100 sims; present only if return_sim_trace=true)
  • meta: {sim_mode, n_simulations, random_seed, tie_rule} for audit

Key resources:

Upstream callers (examples):

  • mlb-category-state-analyzer — passes remaining-week projections per cat (principles #1, #5, #6 in frameworks/game-theory-principles.md)
  • Any fantasy *-category-state-analyzer equivalent for NBA/NHL
  • Any caller that has per-cat {mean, stddev} and wants matchup-level win probability
Weekly Installs
8
Repository
lyndonkl/claude
GitHub Stars
85
First Seen
1 day ago