model-evaluator
Model Evaluator
Overview
Provides comprehensive, unbiased model evaluation following ML best practices. Goes beyond simple accuracy to evaluate models across multiple dimensions, ensuring confident deployment decisions.
Core Evaluation Framework
1. Classification Metrics
- Accuracy, Precision, Recall, F1-score
- ROC AUC, PR AUC
- Confusion matrix
- Per-class metrics (for multi-class)
- Class imbalance handling
2. Regression Metrics
- RMSE, MAE, MAPE
- R² score, Adjusted R²
- Residual analysis
- Prediction interval coverage
3. Ranking Metrics (Recommendations)
- Precision@K, Recall@K
- NDCG@K, MAP@K
- MRR (Mean Reciprocal Rank)
- Coverage, Diversity
4. Statistical Validation
- Cross-validation (K-fold, stratified, time-series)
- Confidence intervals
- Statistical significance testing
- Calibration curves
Usage
from specweave import ModelEvaluator
evaluator = ModelEvaluator(
model=trained_model,
X_test=X_test,
y_test=y_test,
increment="0042"
)
# Comprehensive evaluation
report = evaluator.evaluate_all()
# Generates:
# - .specweave/increments/0042.../evaluation-report.md
# - Visualizations (confusion matrix, ROC curves, etc.)
# - Statistical tests
Evaluation Report Structure
# Model Evaluation Report: XGBoost Classifier
## Overall Performance
- **Accuracy**: 0.87 ± 0.02 (95% CI: [0.85, 0.89])
- **ROC AUC**: 0.92 ± 0.01
- **F1 Score**: 0.85 ± 0.02
## Per-Class Performance
| Class | Precision | Recall | F1 | Support |
|---------|-----------|--------|------|---------|
| Class 0 | 0.88 | 0.85 | 0.86 | 1000 |
| Class 1 | 0.84 | 0.87 | 0.86 | 800 |
## Confusion Matrix
[Visualization embedded]
## Cross-Validation Results
- 5-fold CV accuracy: 0.86 ± 0.03
- Fold scores: [0.85, 0.88, 0.84, 0.87, 0.86]
- No overfitting detected (train=0.89, val=0.86, gap=0.03)
## Statistical Tests
- Comparison vs baseline: p=0.001 (highly significant)
- Comparison vs previous model: p=0.042 (significant)
## Recommendations
✅ Deploy: Model meets accuracy threshold (>0.85)
✅ Stable: Low variance across folds
⚠️ Monitor: Class 1 recall slightly lower (0.84)
Model Comparison
from specweave import compare_models
models = {
"baseline": baseline_model,
"xgboost": xgb_model,
"lightgbm": lgbm_model,
"neural-net": nn_model
}
comparison = compare_models(
models,
X_test,
y_test,
metrics=["accuracy", "auc", "f1"],
increment="0042"
)
Output:
Model Comparison Report
=======================
| Model | Accuracy | ROC AUC | F1 | Inference Time | Model Size |
|------------|----------|---------|------|----------------|------------|
| baseline | 0.65 | 0.70 | 0.62 | 1ms | 10KB |
| xgboost | 0.87 | 0.92 | 0.85 | 35ms | 12MB |
| lightgbm | 0.86 | 0.91 | 0.84 | 28ms | 8MB |
| neural-net | 0.85 | 0.90 | 0.83 | 120ms | 45MB |
Recommendation: XGBoost
- Best accuracy and AUC
- Acceptable inference time (<50ms requirement)
- Good size/performance tradeoff
Best Practices
- Always compare to baseline - Random, majority, rule-based
- Use cross-validation - Never trust single split
- Check calibration - Are probabilities meaningful?
- Analyze errors - What types of mistakes?
- Test statistical significance - Is improvement real?
Integration with SpecWeave
# Evaluate model in increment
/ml:evaluate-model 0042
# Compare all models in increment
/ml:compare-models 0042
# Generate full evaluation report
/ml:evaluation-report 0042
Evaluation results automatically included in increment COMPLETION-SUMMARY.md.
More from anton-abyzov/specweave
technical-writing
Technical writing expert for API documentation, README files, tutorials, changelog management, and developer documentation. Covers style guides, information architecture, versioning docs, OpenAPI/Swagger, and documentation-as-code. Activates for technical writing, API docs, README, changelog, tutorial writing, documentation, technical communication, style guide, OpenAPI, Swagger, developer docs.
45spec-driven-brainstorming
Spec-driven brainstorming and product discovery expert. Helps teams ideate features, break down epics, conduct story mapping sessions, prioritize using MoSCoW/RICE/Kano, and validate ideas with lean startup methods. Activates for brainstorming, product discovery, story mapping, feature ideation, prioritization, MoSCoW, RICE, Kano model, lean startup, MVP definition, product backlog, feature breakdown.
43kafka-architecture
Apache Kafka architecture expert for cluster design, capacity planning, and high availability. Use when designing Kafka clusters, choosing partition strategies, or sizing brokers for production workloads.
34docusaurus
Docusaurus 3.x documentation framework - MDX authoring, theming, versioning, i18n. Use for documentation sites or spec-weave.com.
29frontend
Expert frontend developer for React, Vue, Angular, and modern JavaScript/TypeScript. Use when creating components, implementing hooks, handling state management, or building responsive web interfaces. Covers React 18+ features, custom hooks, form handling, and accessibility best practices.
29reflect
Self-improving AI memory system that persists learnings across sessions in CLAUDE.md. Use when capturing corrections, remembering user preferences, or extracting patterns from successful implementations. Enables continual learning without starting from zero each conversation.
27