data-science-model-evaluation
SKILL.md
Model Evaluation
Use this skill for rigorously assessing model performance, comparing alternatives, and diagnosing issues.
When to use this skill
- Model training complete — need performance assessment
- Comparing multiple models/algorithms
- Diagnosing overfitting/underfitting
- Hyperparameter tuning
- Production readiness check
Evaluation workflow
-
Cross-validation strategy
- K-fold (default for most cases)
- Stratified K-fold (classification with imbalance)
- TimeSeriesSplit (temporal data)
- GroupKFold (grouped/clustered data)
-
Choose appropriate metrics
- Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
- Regression: MAE, RMSE, R², MAPE
- Ranking: NDCG, MAP
- Business: custom metrics tied to outcomes
-
Analyze performance
- Cross-validation mean ± std
- Validation curve (bias-variance tradeoff)
- Learning curves (data sufficiency)
- Error analysis by segment
-
Model comparison
- Statistical significance (paired t-test, McNemar)
- Calibration (for probability outputs)
- Speed vs accuracy tradeoffs
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| Cross-validation | sklearn.model_selection | Standard CV, stratified, time series |
| Metrics | sklearn.metrics | Comprehensive metric suite |
| Hyperparameter tuning | Optuna or Ray Tune | Efficient search algorithms |
| Model comparison | scikit-learn + statistical tests | Paired comparisons |
| Experiment tracking | MLflow or Weights & Biases | Track runs, metrics, artifacts |
Core implementation rules
1) Always use proper validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
2) Match metrics to problem
# Classification with imbalance
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
# Focus on F1, precision/recall for minority class
# Regression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")
3) Analyze errors systematically
# Error by segment
errors = y_pred != y_true
error_df = X_test[errors]
error_df['true'] = y_true[errors]
error_df['pred'] = y_pred[errors]
# Analyze patterns in errors
print(error_df.groupby('category').size())
4) Track experiments
import mlflow
with mlflow.start_run():
mlflow.log_params(params)
mlflow.log_metrics({'auc': auc, 'f1': f1})
mlflow.sklearn.log_model(model, 'model')
Common anti-patterns
- ❌ Single train/test split without CV
- ❌ Optimizing wrong metric (accuracy on imbalanced data)
- ❌ Data leakage in preprocessing
- ❌ Not checking calibration for probability outputs
- ❌ Ignoring inference speed/memory constraints
- ❌ No error analysis or debugging bad predictions
Progressive disclosure
../references/cross-validation.md— CV strategies for different data types../references/metrics-guide.md— Choosing and interpreting metrics../references/hyperparameter-tuning.md— Optuna, Ray Tune patterns../references/experiment-tracking.md— MLflow, W&B setup
Related skills
@data-science-feature-engineering— Features to evaluate@data-engineering-orchestration— Production model deployment@data-engineering-observability— Model monitoring in production
References
Weekly Installs
2
Repository
legout/data-agent-skillsFirst Seen
12 days ago
Security Audits
Installed on
amp2
cline2
opencode2
cursor2
kimi-cli2
codex2