skills/mims-harvard/tooluniverse/tooluniverse-statistical-modeling

tooluniverse-statistical-modeling

SKILL.md

Statistical Modeling for Biomedical Data Analysis

Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.

Features

  • Linear Regression - OLS for continuous outcomes with diagnostic tests
  • Logistic Regression - Binary, ordinal, and multinomial models with odds ratios
  • Survival Analysis - Cox proportional hazards and Kaplan-Meier curves
  • Mixed-Effects Models - LMM/GLMM for hierarchical/repeated measures data
  • ANOVA - One-way/two-way ANOVA, per-feature ANOVA for omics data
  • Model Diagnostics - Assumption checking, fit statistics, residual analysis
  • Statistical Tests - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.

When to Use

Apply this skill when user asks:

  • "What is the odds ratio of X associated with Y?"
  • "What is the hazard ratio for treatment?"
  • "Fit a linear regression of Y on X1, X2, X3"
  • "Perform ordinal logistic regression for severity outcome"
  • "What is the Kaplan-Meier survival estimate at time T?"
  • "What is the percentage reduction in odds ratio after adjusting for confounders?"
  • "Run a mixed-effects model with random intercepts"
  • "Compute the interaction term between A and B"
  • "What is the F-statistic from ANOVA comparing groups?"
  • "Test if gene/miRNA expression differs across cell types"

Model Selection Decision Tree

START: What type of outcome variable?
|
+-- CONTINUOUS (height, blood pressure, score)
|   +-- Independent observations -> Linear Regression (OLS)
|   +-- Repeated measures -> Mixed-Effects Model (LMM)
|   +-- Count data -> Poisson/Negative Binomial
|
+-- BINARY (yes/no, disease/healthy)
|   +-- Independent observations -> Logistic Regression
|   +-- Repeated measures -> Logistic Mixed-Effects (GLMM/GEE)
|   +-- Rare events -> Firth logistic regression
|
+-- ORDINAL (mild/moderate/severe, stages I/II/III/IV)
|   +-- Ordinal Logistic Regression (Proportional Odds)
|
+-- MULTINOMIAL (>2 unordered categories)
|   +-- Multinomial Logistic Regression
|
+-- TIME-TO-EVENT (survival time + censoring)
    +-- Regression -> Cox Proportional Hazards
    +-- Survival curves -> Kaplan-Meier

Workflow

Phase 0: Data Validation

Goal: Load data, identify variable types, check for missing values.

CRITICAL: Identify the Outcome Variable First

Before any analysis, verify what you're actually predicting:

  1. Read the full question - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
  2. Examine available columns - List all columns in the dataset
  3. Match question to data - Find the column that matches the described outcome
  4. Verify outcome exists - Don't create outcome variables from predictors

Common mistake: Question mentions "obesity" -> Assumed outcome = BMI >= 30 (circular logic with BMI predictor). Always check data columns first: print(df.columns.tolist())

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
print(f"Observations: {len(df)}, Variables: {len(df.columns)}, Missing: {df.isnull().sum().sum()}")

for col in df.columns:
    n_unique = df[col].nunique()
    if n_unique == 2:
        print(f"{col}: binary")
    elif n_unique <= 10 and df[col].dtype == 'object':
        print(f"{col}: categorical ({n_unique} levels)")
    elif df[col].dtype in ['float64', 'int64']:
        print(f"{col}: continuous (mean={df[col].mean():.2f})")

Phase 1: Model Fitting

Goal: Fit appropriate model based on outcome type.

Use the decision tree above to select model type, then refer to the appropriate reference file for detailed code:

  • Linear Regression: references/linear_models.md
  • Logistic Regression (binary): references/logistic_regression.md
  • Ordinal Logistic: references/ordinal_logistic.md
  • Cox Proportional Hazards: references/cox_regression.md
  • ANOVA / Statistical Tests: anova_and_tests.md

Quick reference for key models:

import statsmodels.formula.api as smf
import numpy as np

# Linear regression
model = smf.ols('outcome ~ predictor1 + predictor2', data=df).fit()

# Logistic regression (odds ratios)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
ors = np.exp(model.params)
ci = np.exp(model.conf_int())

# Cox proportional hazards
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')
hr = cph.hazard_ratios_['treatment']

Phase 1b: ANOVA for Multi-Feature Data

When data has multiple features (genes, miRNAs, metabolites), use per-feature ANOVA (not aggregate). This is the most common pattern in genomics.

See anova_and_tests.md for the full decision tree, both methods, and worked examples.

Default for gene expression data: Per-feature ANOVA (Method B).

Phase 2: Model Diagnostics

Goal: Check model assumptions and fit quality.

Key diagnostics by model type:

  • OLS: Shapiro-Wilk (normality), Breusch-Pagan (heteroscedasticity), VIF (multicollinearity)
  • Cox: Proportional hazards test via cph.check_assumptions()
  • Logistic: Hosmer-Lemeshow, ROC/AUC

See references/troubleshooting.md for diagnostic code and common issues.

Phase 3: Interpretation

Goal: Generate publication-quality summary.

For every result, report: effect size (OR/HR/coefficient), 95% CI, p-value, and model fit statistic. See bixbench_patterns_summary.md for common question-answer patterns.

Common BixBench Patterns

Pattern Question Type Key Steps
1 Odds ratio from ordinal regression Fit OrderedModel, exp(coef)
2 Percentage reduction in OR Compare crude vs adjusted model
3 Interaction effects Fit A * B, extract A:B coef
4 Hazard ratio Cox PH model, exp(coef)
5 Multi-feature ANOVA Per-feature F-stats (not aggregate)

See bixbench_patterns_summary.md for solution code for each pattern. See references/bixbench_patterns.md for 15+ detailed question patterns.

Statsmodels vs Scikit-learn

Use Case Library Reason
Inference (p-values, CIs, ORs) statsmodels Full statistical output
Prediction (accuracy, AUC) scikit-learn Better prediction tools
Mixed-effects models statsmodels Only option
Regularization (LASSO, Ridge) scikit-learn Better optimization
Survival analysis lifelines Specialized library

General rule: Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).

Python Package Requirements

statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0

Key Principles

  1. Data-first approach - Always inspect and validate data before modeling
  2. Model selection by outcome type - Use decision tree above
  3. Assumption checking - Verify model assumptions (linearity, proportional hazards, etc.)
  4. Complete reporting - Always report effect sizes, CIs, p-values, and model fit statistics
  5. Confounder awareness - Adjust for confounders when specified or clinically relevant
  6. Reproducible analysis - All code must be deterministic and reproducible
  7. Robust error handling - Graceful handling of convergence failures, separation, collinearity
  8. Round correctly - Match the precision requested (typically 2-4 decimal places)

Completeness Checklist

Before finalizing any statistical analysis:

  • Outcome variable identified: Verified which column is the actual outcome
  • Data validated: N, missing values, variable types confirmed
  • Multi-feature data identified: If multiple features, use per-feature approach
  • Model appropriate: Outcome type matches model family
  • Assumptions checked: Relevant diagnostics performed
  • Effect sizes reported: OR/HR/Cohen's d with CIs
  • P-values reported: With appropriate correction if needed
  • Model fit assessed: R-squared, AIC/BIC, concordance
  • Results interpreted: Plain-language interpretation
  • Precision correct: Numbers rounded appropriately

File Structure

tooluniverse-statistical-modeling/
+-- SKILL.md                          # This file (workflow guide)
+-- QUICK_START.md                    # 8 quick examples
+-- EXAMPLES.md                       # Legacy examples
+-- TOOLS_REFERENCE.md                # ToolUniverse tool catalog
+-- anova_and_tests.md                # ANOVA decision tree and code
+-- bixbench_patterns_summary.md      # Common BixBench solution patterns
+-- test_skill.py                     # Test suite
+-- references/
|   +-- logistic_regression.md        # Detailed logistic examples
|   +-- ordinal_logistic.md           # Ordinal logit guide
|   +-- cox_regression.md             # Survival analysis guide
|   +-- linear_models.md              # OLS and mixed-effects
|   +-- bixbench_patterns.md          # 15+ question patterns
|   +-- troubleshooting.md            # Diagnostic issues
+-- scripts/
    +-- format_statistical_output.py  # Format results for reporting
    +-- model_diagnostics.py          # Automated diagnostics

ToolUniverse Integration

While this skill is primarily computational, ToolUniverse tools can provide data:

Use Case Tools
Clinical trial data clinical_trials_search
Drug safety outcomes FAERS_calculate_disproportionality
Gene-disease associations OpenTargets_target_disease_evidence
Biomarker data fda_pharmacogenomic_biomarkers

See TOOLS_REFERENCE.md for complete tool catalog.

References

Support

For detailed examples and troubleshooting:

  • Logistic regression: references/logistic_regression.md
  • Ordinal models: references/ordinal_logistic.md
  • Survival analysis: references/cox_regression.md
  • Linear/mixed models: references/linear_models.md
  • BixBench patterns: references/bixbench_patterns.md
  • ANOVA and tests: anova_and_tests.md
  • Diagnostics: references/troubleshooting.md
Weekly Installs
113
GitHub Stars
1.1K
First Seen
Feb 19, 2026
Installed on
gemini-cli110
codex110
opencode109
github-copilot109
amp106
kimi-cli106