evaluating-ml-models

Installation

SKILL.md

Evaluating ML Models

Use this skill for rigorously assessing model performance, comparing alternatives, diagnosing issues, and optimizing hyperparameters.

When to use this skill

Model training complete — need systematic performance assessment
Comparing multiple models/algorithms — statistical model comparison
Diagnosing overfitting/underfitting — bias-variance analysis
Hyperparameter tuning — finding optimal configurations
Selecting appropriate metrics — matching metrics to business objectives
Experiment tracking — reproducible experimentation
Production readiness check — validation before deployment

When NOT to use this skill

Feature engineering and preprocessing → use engineering-ml-features
Exploratory data analysis → use analyzing-data
Building interactive data apps → use @building-data-apps
Notebook setup and workflows → use @working-in-notebooks

Quick tool selection

Task	Default choice	Notes
Cross-validation	sklearn.model_selection	Standard CV, stratified, time series, grouped
Classification metrics	sklearn.metrics	Accuracy, precision, recall, F1, ROC-AUC, PR-AUC
Regression metrics	sklearn.metrics	MAE, RMSE, R², MAPE
Hyperparameter tuning	Optuna	Bayesian optimization with pruning
Distributed tuning	Ray Tune	Large-scale distributed search
Experiment tracking	MLflow	Open-source, model registry
Cloud experiment tracking	Weights & Biases	Collaboration-focused
Model comparison	scipy.stats	Paired t-test, McNemar's test

Evaluation workflow

1. Choose cross-validation strategy

Match CV to your data characteristics:

Data Type	CV Strategy	Implementation
Standard tabular	K-Fold	`KFold(n_splits=5, shuffle=True)`
Classification (imbalanced)	Stratified K-Fold	`StratifiedKFold(n_splits=5)`
Time series	Time Series Split	`TimeSeriesSplit(n_splits=5)`
Grouped/ clustered	Group K-Fold	`GroupKFold(n_splits=5)`

2. Select appropriate metrics

Classification:

Balanced: Accuracy
Imbalanced: F1, Precision-Recall, ROC-AUC
Cost-sensitive: Custom business metrics

Regression:

Scale-independent: MAPE, R²
Error magnitude: MAE (robust), RMSE (penalizes large errors)

3. Analyze performance

Cross-validation mean ± std (estimate variance)
Validation curves (bias-variance tradeoff)
Learning curves (data sufficiency)
Error analysis by segment

4. Hyperparameter optimization

Use Bayesian optimization for efficient search:

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 10, 200),
        'max_depth': trial.suggest_int('max_depth', 2, 32, log=True)
    }
    model = RandomForestClassifier(**params)
    return cross_val_score(model, X, y, cv=5).mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

5. Track and compare experiments

Log everything for reproducibility:

Hyperparameters
Metrics (CV and test)
Model artifacts
Dataset versions

Core implementation rules

1. Always use proper validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

2. Match metrics to problem type

# Classification with imbalance
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_true, y_pred))
y_proba = model.predict_proba(X_test)[:, 1]
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")

# Regression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error

print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")

3. Validate on hold-out test set

# Final evaluation on untouched test set
# Never optimize hyperparameters on test data!
y_pred = best_model.predict(X_test)
test_score = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {test_score:.3f}")

4. Analyze errors systematically

# Error by segment
errors = y_pred != y_true
error_df = X_test[errors].copy()
error_df['true'] = y_true[errors]
error_df['pred'] = y_pred[errors]

# Analyze patterns
print(error_df.groupby('category').size())

Common anti-patterns

Anti-pattern	Solution
❌ Single train/test split without CV	Use stratified k-fold for robust estimates
❌ Optimizing accuracy on imbalanced data	Use F1, PR-AUC, or class-balanced metrics
❌ Data leakage in preprocessing	Fit preprocessors on train only, use pipelines
❌ Not checking calibration for probabilities	Use `sklearn.calibration.CalibratedClassifierCV`
❌ Ignoring inference speed/memory	Profile prediction time, consider model size
❌ No error analysis	Segment errors by features to find patterns
❌ Overfitting validation set	Keep final test set completely untouched
❌ Not tracking random seeds	Set `random_state` everywhere for reproducibility

Progressive disclosure

Detailed reference guides for specific topics:

references/cross-validation.md — CV strategies for different data types (K-Fold, Stratified, Time Series, Group K-Fold)
references/metrics-guide.md — Choosing and interpreting classification and regression metrics
references/hyperparameter-tuning.md — Optuna and Ray Tune optimization patterns
references/experiment-tracking.md — MLflow and Weights & Biases setup

Related skills

@engineering-ml-features — Upstream feature engineering before evaluation
@orchestrating-data-pipelines — Production model deployment
@assuring-data-pipelines — Model monitoring in production

References

Related skills

More from legout/data-agent-skills

Installs

Repository

legout/data-agent-skills

First Seen

Mar 13, 2026

Security Audits

SnykPass

evaluating-ml-models

Evaluating ML Models

When to use this skill

When NOT to use this skill

Quick tool selection

Evaluation workflow

1. Choose cross-validation strategy

2. Select appropriate metrics

3. Analyze performance

4. Hyperparameter optimization

5. Track and compare experiments

Core implementation rules

1. Always use proper validation

2. Match metrics to problem type

3. Validate on hold-out test set

4. Analyze errors systematically

Common anti-patterns

Progressive disclosure

Related skills

References

More from legout/data-agent-skills

data-engineering-storage-remote-access-libraries-obstore

data-science-eda

data-engineering-storage-remote-access-libraries-fsspec

data-engineering-storage-remote-access-integrations-duckdb

data-engineering-storage-remote-access-libraries-pyarrow-fs

flowerpower