AutoML Optimizer

Overview

Automates the tedious process of hyperparameter tuning and model selection. Instead of manually trying different configurations, define a search space and let AutoML find the optimal configuration through intelligent exploration.

Why AutoML?

Manual Tuning Problems:

Time-consuming (hours/days of trial and error)
Subjective (depends on intuition)
Incomplete (can't try all combinations)
Not reproducible (hard to document search process)

AutoML Benefits:

✅ Systematic exploration of search space
✅ Intelligent sampling (Bayesian optimization)
✅ All experiments tracked automatically
✅ Find optimal configuration faster
✅ Reproducible (search process documented)

AutoML Strategies

Strategy 1: Hyperparameter Optimization (Optuna)

from specweave import OptunaOptimizer

# Define search space
def objective(trial):
    # Suggest hyperparameters
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0)
    }
    
    # Train model
    model = XGBClassifier(**params)
    
    # Cross-validation score
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    
    return scores.mean()

# Run optimization
optimizer = OptunaOptimizer(
    objective=objective,
    n_trials=100,
    direction='maximize',
    increment="0042"
)

best_params = optimizer.optimize()

# Creates:
# - .specweave/increments/0042.../experiments/optuna-study/
#   ├── study.db (Optuna database)
#   ├── optimization_history.png
#   ├── param_importances.png
#   ├── parallel_coordinate.png
#   └── best_params.json

Optimization Report:

# Optuna Optimization Report

## Search Space
- n_estimators: [100, 1000]
- max_depth: [3, 10]
- learning_rate: [0.01, 0.3] (log scale)
- subsample: [0.5, 1.0]
- colsample_bytree: [0.5, 1.0]

## Trials: 100
- Completed: 98
- Pruned: 2 (early stopping)
- Failed: 0

## Best Trial (#47)
- ROC AUC: 0.892 ± 0.012
- Parameters:
  - n_estimators: 673
  - max_depth: 6
  - learning_rate: 0.094
  - subsample: 0.78
  - colsample_bytree: 0.91

## Parameter Importance
1. learning_rate (0.42) - Most important
2. n_estimators (0.28)
3. max_depth (0.18)
4. colsample_bytree (0.08)
5. subsample (0.04) - Least important

## Improvement over Default
- Default params: ROC AUC = 0.856
- Optimized params: ROC AUC = 0.892
- Improvement: +4.2%

Strategy 2: Algorithm Selection + Tuning

from specweave import AutoMLPipeline

# Define candidate algorithms with search spaces
pipeline = AutoMLPipeline(increment="0042")

# Add candidates
pipeline.add_candidate(
    name="xgboost",
    model=XGBClassifier,
    search_space={
        'n_estimators': (100, 1000),
        'max_depth': (3, 10),
        'learning_rate': (0.01, 0.3)
    }
)

pipeline.add_candidate(
    name="lightgbm",
    model=LGBMClassifier,
    search_space={
        'n_estimators': (100, 1000),
        'max_depth': (3, 10),
        'learning_rate': (0.01, 0.3)
    }
)

pipeline.add_candidate(
    name="random_forest",
    model=RandomForestClassifier,
    search_space={
        'n_estimators': (100, 500),
        'max_depth': (3, 20),
        'min_samples_split': (2, 20)
    }
)

pipeline.add_candidate(
    name="logistic_regression",
    model=LogisticRegression,
    search_space={
        'C': (0.001, 100),
        'penalty': ['l1', 'l2']
    }
)

# Run AutoML (tries all algorithms + hyperparameters)
results = pipeline.fit(
    X_train, y_train,
    n_trials_per_model=50,
    cv_folds=5,
    metric='roc_auc'
)

# Best model automatically selected
best_model = pipeline.best_model_
best_params = pipeline.best_params_

AutoML Comparison:

| Model               | Trials | Best Score | Mean Score | Std   | Best Params                          |
|---------------------|--------|------------|------------|-------|--------------------------------------|
| xgboost             | 50     | 0.892      | 0.876      | 0.012 | n_est=673, depth=6, lr=0.094         |
| lightgbm            | 50     | 0.889      | 0.873      | 0.011 | n_est=542, depth=7, lr=0.082         |
| random_forest       | 50     | 0.871      | 0.858      | 0.015 | n_est=384, depth=12, min_split=5     |
| logistic_regression | 50     | 0.845      | 0.840      | 0.008 | C=1.234, penalty=l2                  |

**Winner: XGBoost** (ROC AUC = 0.892)

Strategy 3: Neural Architecture Search (NAS)

from specweave import NeuralArchitectureSearch

# For deep learning
nas = NeuralArchitectureSearch(increment="0042")

# Define search space
search_space = {
    'num_layers': (2, 5),
    'layer_sizes': (32, 512),
    'activation': ['relu', 'tanh', 'elu'],
    'dropout': (0.0, 0.5),
    'optimizer': ['adam', 'sgd', 'rmsprop'],
    'learning_rate': (0.0001, 0.01)
}

# Search for best architecture
best_architecture = nas.search(
    X_train, y_train,
    search_space=search_space,
    n_trials=100,
    max_epochs=50
)

# Creates: Best neural network architecture

AutoML Frameworks Integration

Optuna (Recommended)

import optuna
from specweave import configure_optuna

# Auto-configures Optuna to log to increment
configure_optuna(increment="0042")

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
    }
    
    model = XGBClassifier(**params)
    score = cross_val_score(model, X, y, cv=5).mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

# Automatically logged to increment folder

Auto-sklearn

from specweave import AutoSklearnOptimizer

# Automated model selection + feature engineering
optimizer = AutoSklearnOptimizer(
    time_left_for_this_task=3600,  # 1 hour
    increment="0042"
)

optimizer.fit(X_train, y_train)

# Auto-sklearn tries:
# - Multiple algorithms
# - Feature preprocessing combinations
# - Ensemble methods
# Returns best pipeline

H2O AutoML

from specweave import H2OAutoMLOptimizer

optimizer = H2OAutoMLOptimizer(
    max_runtime_secs=3600,  # 1 hour
    max_models=50,
    increment="0042"
)

optimizer.fit(X_train, y_train)

# H2O tries many algorithms in parallel
# Returns leaderboard + best model

Best Practices

1. Start with Default Baseline

# Always compare AutoML to default hyperparameters
baseline_model = XGBClassifier()  # Default params
baseline_score = cross_val_score(baseline_model, X, y, cv=5).mean()

# Then optimize
optimizer = OptunaOptimizer(objective, n_trials=100)
optimized_params = optimizer.optimize()

improvement = (optimized_score - baseline_score) / baseline_score * 100
print(f"Improvement: {improvement:.1f}%")

# Only use optimized if significant improvement (>2-3%)

2. Use Cross-Validation

# ❌ Wrong: Single train/test split
score = model.score(X_test, y_test)

# ✅ Correct: Cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5)
score = scores.mean()

# Prevents overfitting to specific train/test split

3. Set Reasonable Search Budgets

# Quick exploration (development)
optimizer.optimize(n_trials=20)  # ~5-10 minutes

# Moderate search (iteration)
optimizer.optimize(n_trials=100)  # ~30-60 minutes

# Thorough search (final model)
optimizer.optimize(n_trials=500)  # ~2-4 hours

# Don't overdo it: diminishing returns after ~100-200 trials

4. Prune Unpromising Trials

# Optuna can stop bad trials early
study = optuna.create_study(
    direction='maximize',
    pruner=optuna.pruners.MedianPruner()
)

# If trial is performing worse than median at epoch N, stop it
# Saves time by not fully training bad models

5. Document Search Space Rationale

# Document why you chose specific ranges
search_space = {
    # XGBoost recommends max_depth 3-10 for most tasks
    'max_depth': (3, 10),
    
    # Learning rate: 0.01-0.3 covers slow to fast learning
    # Log scale to spend more trials on smaller values
    'learning_rate': (0.01, 0.3, 'log'),
    
    # n_estimators: Balance accuracy vs training time
    'n_estimators': (100, 1000)
}

Integration with SpecWeave

Automatic Experiment Tracking

# All AutoML trials logged automatically
optimizer = OptunaOptimizer(objective, increment="0042")
optimizer.optimize(n_trials=100)

# Creates:
# .specweave/increments/0042.../experiments/
# ├── optuna-trial-001/
# ├── optuna-trial-002/
# ├── ...
# ├── optuna-trial-100/
# └── optuna-summary.md

Living Docs Integration

/sw:sync-docs update

Updates:

<!-- .specweave/docs/internal/architecture/ml-optimization.md -->

## Hyperparameter Optimization (Increment 0042)

### Optimization Strategy
- Framework: Optuna (Bayesian optimization)
- Trials: 100
- Search space: 5 hyperparameters
- Metric: ROC AUC (5-fold CV)

### Results
- Best score: 0.892 ± 0.012
- Improvement over default: +4.2%
- Most important param: learning_rate (0.42)

### Selected Hyperparameters
```python
{
    'n_estimators': 673,
    'max_depth': 6,
    'learning_rate': 0.094,
    'subsample': 0.78,
    'colsample_bytree': 0.91
}

Recommendation

XGBoost with optimized hyperparameters for production deployment.


## Commands

```bash
# Run AutoML optimization
/ml:optimize 0042 --trials 100

# Compare algorithms
/ml:compare-algorithms 0042

# Show optimization history
/ml:optimization-report 0042

Common Patterns

Pattern 1: Coarse-to-Fine Optimization

# Step 1: Coarse search (wide ranges, few trials)
coarse_space = {
    'n_estimators': (100, 1000, 'int'),
    'max_depth': (3, 10, 'int'),
    'learning_rate': (0.01, 0.3, 'log')
}
coarse_results = optimizer.optimize(coarse_space, n_trials=50)

# Step 2: Fine search (narrow ranges around best)
best_params = coarse_results['best_params']
fine_space = {
    'n_estimators': (best_params['n_estimators'] - 100, 
                     best_params['n_estimators'] + 100),
    'max_depth': (max(3, best_params['max_depth'] - 1),
                  min(10, best_params['max_depth'] + 1)),
    'learning_rate': (best_params['learning_rate'] * 0.5,
                      best_params['learning_rate'] * 1.5, 'log')
}
fine_results = optimizer.optimize(fine_space, n_trials=50)

Pattern 2: Multi-Objective Optimization

# Optimize for multiple objectives (accuracy + speed)
def multi_objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
    }
    
    model = XGBClassifier(**params)
    
    # Objective 1: Accuracy
    accuracy = cross_val_score(model, X, y, cv=5).mean()
    
    # Objective 2: Training time
    start = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start
    
    return accuracy, -training_time  # Maximize accuracy, minimize time

# Optuna will find Pareto-optimal solutions
study = optuna.create_study(directions=['maximize', 'minimize'])
study.optimize(multi_objective, n_trials=100)

Summary

AutoML accelerates ML development by:

✅ Automating tedious hyperparameter tuning
✅ Exploring search space systematically
✅ Finding optimal configurations faster
✅ Tracking all experiments automatically
✅ Documenting optimization process

Don't spend days manually tuning—let AutoML do it in hours.

automl-optimizer