evaluating-ml-models
Evaluating ML Models
Use this skill for rigorously assessing model performance, comparing alternatives, diagnosing issues, and optimizing hyperparameters.
When to use this skill
- Model training complete — need systematic performance assessment
- Comparing multiple models/algorithms — statistical model comparison
- Diagnosing overfitting/underfitting — bias-variance analysis
- Hyperparameter tuning — finding optimal configurations
- Selecting appropriate metrics — matching metrics to business objectives
- Experiment tracking — reproducible experimentation
- Production readiness check — validation before deployment
When NOT to use this skill
- Feature engineering and preprocessing → use
engineering-ml-features - Exploratory data analysis → use
analyzing-data - Building interactive data apps → use
@building-data-apps - Notebook setup and workflows → use
@working-in-notebooks
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| Cross-validation | sklearn.model_selection | Standard CV, stratified, time series, grouped |
| Classification metrics | sklearn.metrics | Accuracy, precision, recall, F1, ROC-AUC, PR-AUC |
| Regression metrics | sklearn.metrics | MAE, RMSE, R², MAPE |
| Hyperparameter tuning | Optuna | Bayesian optimization with pruning |
| Distributed tuning | Ray Tune | Large-scale distributed search |
| Experiment tracking | MLflow | Open-source, model registry |
| Cloud experiment tracking | Weights & Biases | Collaboration-focused |
| Model comparison | scipy.stats | Paired t-test, McNemar's test |
Evaluation workflow
1. Choose cross-validation strategy
Match CV to your data characteristics:
| Data Type | CV Strategy | Implementation |
|---|---|---|
| Standard tabular | K-Fold | KFold(n_splits=5, shuffle=True) |
| Classification (imbalanced) | Stratified K-Fold | StratifiedKFold(n_splits=5) |
| Time series | Time Series Split | TimeSeriesSplit(n_splits=5) |
| Grouped/ clustered | Group K-Fold | GroupKFold(n_splits=5) |
2. Select appropriate metrics
Classification:
- Balanced: Accuracy
- Imbalanced: F1, Precision-Recall, ROC-AUC
- Cost-sensitive: Custom business metrics
Regression:
- Scale-independent: MAPE, R²
- Error magnitude: MAE (robust), RMSE (penalizes large errors)
3. Analyze performance
- Cross-validation mean ± std (estimate variance)
- Validation curves (bias-variance tradeoff)
- Learning curves (data sufficiency)
- Error analysis by segment
4. Hyperparameter optimization
Use Bayesian optimization for efficient search:
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 10, 200),
'max_depth': trial.suggest_int('max_depth', 2, 32, log=True)
}
model = RandomForestClassifier(**params)
return cross_val_score(model, X, y, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
5. Track and compare experiments
Log everything for reproducibility:
- Hyperparameters
- Metrics (CV and test)
- Model artifacts
- Dataset versions
Core implementation rules
1. Always use proper validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
2. Match metrics to problem type
# Classification with imbalance
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_true, y_pred))
y_proba = model.predict_proba(X_test)[:, 1]
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
# Regression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")
3. Validate on hold-out test set
# Final evaluation on untouched test set
# Never optimize hyperparameters on test data!
y_pred = best_model.predict(X_test)
test_score = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {test_score:.3f}")
4. Analyze errors systematically
# Error by segment
errors = y_pred != y_true
error_df = X_test[errors].copy()
error_df['true'] = y_true[errors]
error_df['pred'] = y_pred[errors]
# Analyze patterns
print(error_df.groupby('category').size())
Common anti-patterns
| Anti-pattern | Solution |
|---|---|
| ❌ Single train/test split without CV | Use stratified k-fold for robust estimates |
| ❌ Optimizing accuracy on imbalanced data | Use F1, PR-AUC, or class-balanced metrics |
| ❌ Data leakage in preprocessing | Fit preprocessors on train only, use pipelines |
| ❌ Not checking calibration for probabilities | Use sklearn.calibration.CalibratedClassifierCV |
| ❌ Ignoring inference speed/memory | Profile prediction time, consider model size |
| ❌ No error analysis | Segment errors by features to find patterns |
| ❌ Overfitting validation set | Keep final test set completely untouched |
| ❌ Not tracking random seeds | Set random_state everywhere for reproducibility |
Progressive disclosure
Detailed reference guides for specific topics:
references/cross-validation.md— CV strategies for different data types (K-Fold, Stratified, Time Series, Group K-Fold)references/metrics-guide.md— Choosing and interpreting classification and regression metricsreferences/hyperparameter-tuning.md— Optuna and Ray Tune optimization patternsreferences/experiment-tracking.md— MLflow and Weights & Biases setup
Related skills
@engineering-ml-features— Upstream feature engineering before evaluation@orchestrating-data-pipelines— Production model deployment@assuring-data-pipelines— Model monitoring in production
References
More from legout/data-agent-skills
data-engineering-storage-remote-access-libraries-obstore
High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper.
4data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
4data-engineering-storage-remote-access-libraries-fsspec
Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem.
4data-engineering-storage-remote-access-integrations-duckdb
Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints.
4data-engineering-storage-remote-access-libraries-pyarrow-fs
Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4