Experiment Tracker

Overview

Transforms chaotic ML experimentation into organized, reproducible research. Every experiment is logged, versioned, and tied to a SpecWeave increment, ensuring team knowledge is preserved and experiments are reproducible.

Problem This Solves

Without structured tracking:

❌ "Which hyperparameters did we use for model v2?"
❌ "Why did we choose XGBoost over LightGBM?"
❌ "Can't reproduce results from 3 months ago"
❌ "Team member left, all knowledge in their notebooks"

With experiment tracking:

✅ All experiments logged with params, metrics, artifacts
✅ Decisions documented ("XGBoost: 5% better precision, chose it")
✅ Reproducible (environment, data version, code hash)
✅ Team knowledge in living docs, not individual notebooks

How It Works

Auto-Configuration

When you create an ML increment, the skill detects tracking tools:

# No configuration needed - automatically detects and configures
from specweave import track_experiment

# Automatically logs to:
# .specweave/increments/0042.../experiments/exp-001/
with track_experiment("baseline-model") as exp:
    model.fit(X_train, y_train)
    exp.log_metric("accuracy", accuracy)

Tracking Backends

Option 1: SpecWeave Built-in (default, zero-config)

from specweave import track_experiment

# Logs to increment folder automatically
with track_experiment("xgboost-v1") as exp:
    exp.log_param("n_estimators", 100)
    exp.log_metric("auc", 0.87)
    exp.save_model(model, "model.pkl")

# Creates:
# .specweave/increments/0042.../experiments/xgboost-v1/
# ├── params.json
# ├── metrics.json
# ├── model.pkl
# └── metadata.yaml

Option 2: MLflow (if detected in project)

import mlflow
from specweave import configure_mlflow

# Auto-configures MLflow to log to increment
configure_mlflow(increment="0042")

with mlflow.start_run(run_name="xgboost-v1"):
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("auc", 0.87)
    mlflow.sklearn.log_model(model, "model")

# Still logs to increment folder, just uses MLflow as backend

Option 3: Weights & Biases

import wandb
from specweave import configure_wandb

# Auto-configures W&B project = increment ID
configure_wandb(increment="0042")

run = wandb.init(name="xgboost-v1")
run.log({"auc": 0.87})
run.log_model("model.pkl")

# W&B dashboard + local logs in increment folder

Experiment Comparison

from specweave import compare_experiments

# Compare all experiments in increment
comparison = compare_experiments(increment="0042")

# Generates:
# .specweave/increments/0042.../experiments/comparison.md

Output:

| Experiment         | Accuracy | Precision | Recall | F1   | Training Time |
|--------------------|----------|-----------|--------|------|---------------|
| exp-001-baseline   | 0.65     | 0.60      | 0.55   | 0.57 | 2s            |
| exp-002-xgboost    | 0.87     | 0.85      | 0.83   | 0.84 | 45s           |
| exp-003-lightgbm   | 0.86     | 0.84      | 0.82   | 0.83 | 32s           |
| exp-004-neural-net | 0.85     | 0.83      | 0.81   | 0.82 | 320s          |

**Best Model**: exp-002-xgboost
- Highest accuracy (0.87)
- Good precision/recall balance
- Reasonable training time (45s)
- Selected for deployment

Living Docs Integration

After completing increment:

/sw:sync-docs update

Automatically updates:

<!-- .specweave/docs/internal/architecture/ml-experiments.md -->

## Recommendation Model (Increment 0042)

### Experiments Conducted: 7
- exp-001-baseline: Random classifier (acc=0.12)
- exp-002-popularity: Popularity baseline (acc=0.18)
- exp-003-xgboost: XGBoost classifier (acc=0.26) ✅ **SELECTED**
- ...

### Selection Rationale
XGBoost chosen for:
- Best accuracy (0.26 vs baseline 0.18, +44% improvement)
- Fast inference (<50ms)
- Good explainability (SHAP values)
- Stable across cross-validation (std=0.02)

### Hyperparameters (exp-003)
- n_estimators: 200
- max_depth: 6
- learning_rate: 0.1
- subsample: 0.8

When to Use This Skill

Activate when you need to:

Track ML experiments systematically
Compare multiple models objectively
Document experiment decisions for team
Reproduce past results exactly
Maintain experiment history across increments

Key Features

1. Automatic Logging

# Logs everything automatically
from specweave import AutoTracker

tracker = AutoTracker(increment="0042")

# Just wrap your training code
@tracker.track(name="xgboost-auto")
def train_model():
    model = XGBClassifier(**params)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    return model, score

# Automatically logs: params, metrics, model, environment, git hash
model, score = train_model()

2. Hyperparameter Tracking

from specweave import track_hyperparameters

params_grid = {
    "n_estimators": [100, 200, 500],
    "max_depth": [3, 6, 9],
    "learning_rate": [0.01, 0.1, 0.3]
}

# Tracks all parameter combinations
results = track_hyperparameters(
    model=XGBClassifier,
    param_grid=params_grid,
    X_train=X_train,
    y_train=y_train,
    increment="0042"
)

# Generates parameter importance analysis

3. Cross-Validation Tracking

from specweave import track_cross_validation

# Tracks each fold separately
cv_results = track_cross_validation(
    model=model,
    X=X,
    y=y,
    cv=5,
    increment="0042"
)

# Logs: mean, std, per-fold scores, fold distribution

4. Artifact Management

from specweave import track_artifacts

with track_experiment("xgboost-v1") as exp:
    # Training artifacts
    exp.save_artifact("preprocessor.pkl", preprocessor)
    exp.save_artifact("model.pkl", model)
    
    # Evaluation artifacts
    exp.save_artifact("confusion_matrix.png", cm_plot)
    exp.save_artifact("roc_curve.png", roc_plot)
    
    # Data artifacts
    exp.save_artifact("feature_importance.csv", importance_df)
    
    # Environment artifacts
    exp.save_artifact("requirements.txt", requirements)
    exp.save_artifact("conda_env.yaml", conda_env)

5. Experiment Metadata

from specweave import ExperimentMetadata

metadata = ExperimentMetadata(
    name="xgboost-v3",
    description="XGBoost with feature engineering v2",
    tags=["production-candidate", "feature-eng-v2"],
    git_commit="a3b8c9d",
    data_version="v2024-01",
    author="[email protected]"
)

with track_experiment(metadata) as exp:
    # ... training ...
    pass

Best Practices

1. Name Experiments Clearly

# ❌ Bad: Generic names
with track_experiment("exp1"):
    ...

# ✅ Good: Descriptive names
with track_experiment("xgboost-tuned-depth6-lr0.1"):
    ...

2. Log Everything

# Log more than you think you need
exp.log_param("random_seed", 42)
exp.log_param("data_version", "2024-01")
exp.log_param("python_version", sys.version)
exp.log_param("sklearn_version", sklearn.__version__)

# Future you will thank present you

3. Document Failures

try:
    with track_experiment("neural-net-attempt") as exp:
        model.fit(X_train, y_train)
except Exception as e:
    exp.log_note(f"FAILED: {str(e)}")
    exp.log_note("Reason: Out of memory, need smaller batch size")
    exp.set_status("failed")
    
# Failure documentation prevents repeating mistakes

4. Use Experiment Series

# Related experiments in series
experiments = [
    "xgboost-baseline",
    "xgboost-tuned-v1",
    "xgboost-tuned-v2",
    "xgboost-tuned-v3-final"
]

# Track progression and improvements

5. Link to Data Versions

with track_experiment("xgboost-v1") as exp:
    exp.log_param("data_commit", "dvc:a3b8c9d")
    exp.log_param("data_url", "s3://bucket/data/v2024-01")
    
# Enables exact reproduction

Integration with SpecWeave

With Increments

# Experiments automatically tied to increment
/sw:inc "0042-recommendation-model"
# All experiments logged to: .specweave/increments/0042.../experiments/

With Living Docs

# Sync experiment findings to docs
/sw:sync-docs update
# Updates: architecture/ml-models.md, runbooks/model-training.md

With GitHub

# Create issue for model retraining
/sw:github:create-issue "Retrain model with Q1 2024 data"
# Links to previous experiments in increment

Examples

Example 1: Baseline Experiments

from specweave import track_experiment

baselines = ["random", "majority", "stratified"]

for strategy in baselines:
    with track_experiment(f"baseline-{strategy}") as exp:
        model = DummyClassifier(strategy=strategy)
        model.fit(X_train, y_train)
        
        accuracy = model.score(X_test, y_test)
        exp.log_metric("accuracy", accuracy)
        exp.log_note(f"Baseline: {strategy}")

# Generates baseline comparison report

Example 2: Hyperparameter Grid Search

from sklearn.model_selection import GridSearchCV
from specweave import track_grid_search

param_grid = {
    "n_estimators": [100, 200, 500],
    "max_depth": [3, 6, 9]
}

# Automatically logs all combinations
best_model, results = track_grid_search(
    XGBClassifier(),
    param_grid,
    X_train,
    y_train,
    increment="0042"
)

# Creates visualization of parameter importance

Example 3: Model Comparison

from specweave import compare_models

models = {
    "xgboost": XGBClassifier(),
    "lightgbm": LGBMClassifier(),
    "random-forest": RandomForestClassifier()
}

# Trains and compares all models
comparison = compare_models(
    models,
    X_train,
    y_train,
    X_test,
    y_test,
    increment="0042"
)

# Generates markdown comparison table

Tool Compatibility

MLflow

# Option 1: Pure MLflow (auto-configured)
import mlflow
mlflow.set_tracking_uri(".specweave/increments/0042.../experiments")

# Option 2: SpecWeave wrapper (recommended)
from specweave import mlflow as sw_mlflow
with sw_mlflow.start_run("xgboost"):
    # Logs to both MLflow and increment docs
    pass

Weights & Biases

# Option 1: Pure wandb
import wandb
wandb.init(project="0042-recommendation-model")

# Option 2: SpecWeave wrapper (recommended)
from specweave import wandb as sw_wandb
run = sw_wandb.init(increment="0042", name="xgboost")
# Syncs to increment folder + W&B dashboard

TensorBoard

from specweave import TensorBoardCallback

# Keras callback
model.fit(
    X_train,
    y_train,
    callbacks=[
        TensorBoardCallback(
            increment="0042",
            log_dir=".specweave/increments/0042.../tensorboard"
        )
    ]
)

Commands

# List all experiments in increment
/ml:list-experiments 0042

# Compare experiments
/ml:compare-experiments 0042

# Load experiment details
/ml:show-experiment exp-003-xgboost

# Export experiment data
/ml:export-experiments 0042 --format csv

Tips

Start tracking early - Track from first experiment, not after 20 failed attempts
Tag production models - exp.add_tag("production") for deployed models
Version everything - Data, code, environment, dependencies
Document decisions - Why model A over model B (not just metrics)
Prune old experiments - Archive experiments >6 months old

Advanced: Multi-Stage Experiments

For complex pipelines with multiple stages:

from specweave import ExperimentPipeline

pipeline = ExperimentPipeline("recommendation-full-pipeline")

# Stage 1: Data preprocessing
with pipeline.stage("preprocessing") as stage:
    stage.log_metric("rows_before", len(df))
    df_clean = preprocess(df)
    stage.log_metric("rows_after", len(df_clean))

# Stage 2: Feature engineering
with pipeline.stage("features") as stage:
    features = engineer_features(df_clean)
    stage.log_metric("num_features", features.shape[1])

# Stage 3: Model training
with pipeline.stage("training") as stage:
    model = train_model(features)
    stage.log_metric("accuracy", accuracy)

# Logs entire pipeline with stage dependencies

Integration Points

ml-pipeline-orchestrator: Auto-tracks experiments during pipeline execution
model-evaluator: Uses experiment data for model comparison
ml-engineer agent: Reviews experiment results and suggests improvements
Living docs: Syncs experiment findings to architecture docs

This skill ensures ML experimentation is never lost, always reproducible, and well-documented.

experiment-tracker