evaluating-ml-models
SKILL.md
Evaluating ML Models
Use this skill for rigorously assessing model performance, comparing alternatives, diagnosing issues, and optimizing hyperparameters.
When to use this skill
- Model training complete — need systematic performance assessment
- Comparing multiple models/algorithms — statistical model comparison
- Diagnosing overfitting/underfitting — bias-variance analysis
- Hyperparameter tuning — finding optimal configurations
- Selecting appropriate metrics — matching metrics to business objectives
- Experiment tracking — reproducible experimentation
- Production readiness check — validation before deployment
When NOT to use this skill
- Feature engineering and preprocessing → use
engineering-ml-features - Exploratory data analysis → use
analyzing-data - Building interactive data apps → use
@building-data-apps - Notebook setup and workflows → use
@working-in-notebooks
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| Cross-validation | sklearn.model_selection | Standard CV, stratified, time series, grouped |
| Classification metrics | sklearn.metrics | Accuracy, precision, recall, F1, ROC-AUC, PR-AUC |
| Regression metrics | sklearn.metrics | MAE, RMSE, R², MAPE |
| Hyperparameter tuning | Optuna | Bayesian optimization with pruning |
| Distributed tuning | Ray Tune | Large-scale distributed search |
| Experiment tracking | MLflow | Open-source, model registry |
| Cloud experiment tracking | Weights & Biases | Collaboration-focused |
| Model comparison | scipy.stats | Paired t-test, McNemar's test |
Evaluation workflow
1. Choose cross-validation strategy
Match CV to your data characteristics:
| Data Type | CV Strategy | Implementation |
|---|---|---|
| Standard tabular | K-Fold | KFold(n_splits=5, shuffle=True) |
| Classification (imbalanced) | Stratified K-Fold | StratifiedKFold(n_splits=5) |
| Time series | Time Series Split | TimeSeriesSplit(n_splits=5) |
| Grouped/ clustered | Group K-Fold | GroupKFold(n_splits=5) |
2. Select appropriate metrics
Classification:
- Balanced: Accuracy
- Imbalanced: F1, Precision-Recall, ROC-AUC
- Cost-sensitive: Custom business metrics
Regression:
- Scale-independent: MAPE, R²
- Error magnitude: MAE (robust), RMSE (penalizes large errors)
3. Analyze performance
- Cross-validation mean ± std (estimate variance)
- Validation curves (bias-variance tradeoff)
- Learning curves (data sufficiency)
- Error analysis by segment
4. Hyperparameter optimization
Use Bayesian optimization for efficient search:
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 10, 200),
'max_depth': trial.suggest_int('max_depth', 2, 32, log=True)
}
model = RandomForestClassifier(**params)
return cross_val_score(model, X, y, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
5. Track and compare experiments
Log everything for reproducibility:
- Hyperparameters
- Metrics (CV and test)
- Model artifacts
- Dataset versions
Core implementation rules
1. Always use proper validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
2. Match metrics to problem type
# Classification with imbalance
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_true, y_pred))
y_proba = model.predict_proba(X_test)[:, 1]
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
# Regression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")
3. Validate on hold-out test set
# Final evaluation on untouched test set
# Never optimize hyperparameters on test data!
y_pred = best_model.predict(X_test)
test_score = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {test_score:.3f}")
4. Analyze errors systematically
# Error by segment
errors = y_pred != y_true
error_df = X_test[errors].copy()
error_df['true'] = y_true[errors]
error_df['pred'] = y_pred[errors]
# Analyze patterns
print(error_df.groupby('category').size())
Common anti-patterns
| Anti-pattern | Solution |
|---|---|
| ❌ Single train/test split without CV | Use stratified k-fold for robust estimates |
| ❌ Optimizing accuracy on imbalanced data | Use F1, PR-AUC, or class-balanced metrics |
| ❌ Data leakage in preprocessing | Fit preprocessors on train only, use pipelines |
| ❌ Not checking calibration for probabilities | Use sklearn.calibration.CalibratedClassifierCV |
| ❌ Ignoring inference speed/memory | Profile prediction time, consider model size |
| ❌ No error analysis | Segment errors by features to find patterns |
| ❌ Overfitting validation set | Keep final test set completely untouched |
| ❌ Not tracking random seeds | Set random_state everywhere for reproducibility |
Progressive disclosure
Detailed reference guides for specific topics:
references/cross-validation.md— CV strategies for different data types (K-Fold, Stratified, Time Series, Group K-Fold)references/metrics-guide.md— Choosing and interpreting classification and regression metricsreferences/hyperparameter-tuning.md— Optuna and Ray Tune optimization patternsreferences/experiment-tracking.md— MLflow and Weights & Biases setup
Related skills
@engineering-ml-features— Upstream feature engineering before evaluation@orchestrating-data-pipelines— Production model deployment@assuring-data-pipelines— Model monitoring in production
References
Weekly Installs
1
Repository
legout/data-agent-skillsFirst Seen
3 days ago
Security Audits
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1