experiment-tracker
SKILL.md
Experiment Tracker
Overview
Transforms chaotic ML experimentation into organized, reproducible research. Every experiment is logged, versioned, and tied to a SpecWeave increment, ensuring team knowledge is preserved and experiments are reproducible.
Problem This Solves
Without structured tracking:
- ❌ "Which hyperparameters did we use for model v2?"
- ❌ "Why did we choose XGBoost over LightGBM?"
- ❌ "Can't reproduce results from 3 months ago"
- ❌ "Team member left, all knowledge in their notebooks"
With experiment tracking:
- ✅ All experiments logged with params, metrics, artifacts
- ✅ Decisions documented ("XGBoost: 5% better precision, chose it")
- ✅ Reproducible (environment, data version, code hash)
- ✅ Team knowledge in living docs, not individual notebooks
How It Works
Auto-Configuration
When you create an ML increment, the skill detects tracking tools:
# No configuration needed - automatically detects and configures
from specweave import track_experiment
# Automatically logs to:
# .specweave/increments/0042.../experiments/exp-001/
with track_experiment("baseline-model") as exp:
model.fit(X_train, y_train)
exp.log_metric("accuracy", accuracy)
Tracking Backends
Option 1: SpecWeave Built-in (default, zero-config)
from specweave import track_experiment
# Logs to increment folder automatically
with track_experiment("xgboost-v1") as exp:
exp.log_param("n_estimators", 100)
exp.log_metric("auc", 0.87)
exp.save_model(model, "model.pkl")
# Creates:
# .specweave/increments/0042.../experiments/xgboost-v1/
# ├── params.json
# ├── metrics.json
# ├── model.pkl
# └── metadata.yaml
Option 2: MLflow (if detected in project)
import mlflow
from specweave import configure_mlflow
# Auto-configures MLflow to log to increment
configure_mlflow(increment="0042")
with mlflow.start_run(run_name="xgboost-v1"):
mlflow.log_param("n_estimators", 100)
mlflow.log_metric("auc", 0.87)
mlflow.sklearn.log_model(model, "model")
# Still logs to increment folder, just uses MLflow as backend
Option 3: Weights & Biases
import wandb
from specweave import configure_wandb
# Auto-configures W&B project = increment ID
configure_wandb(increment="0042")
run = wandb.init(name="xgboost-v1")
run.log({"auc": 0.87})
run.log_model("model.pkl")
# W&B dashboard + local logs in increment folder
Experiment Comparison
from specweave import compare_experiments
# Compare all experiments in increment
comparison = compare_experiments(increment="0042")
# Generates:
# .specweave/increments/0042.../experiments/comparison.md
Output:
| Experiment | Accuracy | Precision | Recall | F1 | Training Time |
|--------------------|----------|-----------|--------|------|---------------|
| exp-001-baseline | 0.65 | 0.60 | 0.55 | 0.57 | 2s |
| exp-002-xgboost | 0.87 | 0.85 | 0.83 | 0.84 | 45s |
| exp-003-lightgbm | 0.86 | 0.84 | 0.82 | 0.83 | 32s |
| exp-004-neural-net | 0.85 | 0.83 | 0.81 | 0.82 | 320s |
**Best Model**: exp-002-xgboost
- Highest accuracy (0.87)
- Good precision/recall balance
- Reasonable training time (45s)
- Selected for deployment
Living Docs Integration
After completing increment:
/sw:sync-docs update
Automatically updates:
<!-- .specweave/docs/internal/architecture/ml-experiments.md -->
## Recommendation Model (Increment 0042)
### Experiments Conducted: 7
- exp-001-baseline: Random classifier (acc=0.12)
- exp-002-popularity: Popularity baseline (acc=0.18)
- exp-003-xgboost: XGBoost classifier (acc=0.26) ✅ **SELECTED**
- ...
### Selection Rationale
XGBoost chosen for:
- Best accuracy (0.26 vs baseline 0.18, +44% improvement)
- Fast inference (<50ms)
- Good explainability (SHAP values)
- Stable across cross-validation (std=0.02)
### Hyperparameters (exp-003)
- n_estimators: 200
- max_depth: 6
- learning_rate: 0.1
- subsample: 0.8
When to Use This Skill
Activate when you need to:
- Track ML experiments systematically
- Compare multiple models objectively
- Document experiment decisions for team
- Reproduce past results exactly
- Maintain experiment history across increments
Key Features
1. Automatic Logging
# Logs everything automatically
from specweave import AutoTracker
tracker = AutoTracker(increment="0042")
# Just wrap your training code
@tracker.track(name="xgboost-auto")
def train_model():
model = XGBClassifier(**params)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
return model, score
# Automatically logs: params, metrics, model, environment, git hash
model, score = train_model()
2. Hyperparameter Tracking
from specweave import track_hyperparameters
params_grid = {
"n_estimators": [100, 200, 500],
"max_depth": [3, 6, 9],
"learning_rate": [0.01, 0.1, 0.3]
}
# Tracks all parameter combinations
results = track_hyperparameters(
model=XGBClassifier,
param_grid=params_grid,
X_train=X_train,
y_train=y_train,
increment="0042"
)
# Generates parameter importance analysis
3. Cross-Validation Tracking
from specweave import track_cross_validation
# Tracks each fold separately
cv_results = track_cross_validation(
model=model,
X=X,
y=y,
cv=5,
increment="0042"
)
# Logs: mean, std, per-fold scores, fold distribution
4. Artifact Management
from specweave import track_artifacts
with track_experiment("xgboost-v1") as exp:
# Training artifacts
exp.save_artifact("preprocessor.pkl", preprocessor)
exp.save_artifact("model.pkl", model)
# Evaluation artifacts
exp.save_artifact("confusion_matrix.png", cm_plot)
exp.save_artifact("roc_curve.png", roc_plot)
# Data artifacts
exp.save_artifact("feature_importance.csv", importance_df)
# Environment artifacts
exp.save_artifact("requirements.txt", requirements)
exp.save_artifact("conda_env.yaml", conda_env)
5. Experiment Metadata
from specweave import ExperimentMetadata
metadata = ExperimentMetadata(
name="xgboost-v3",
description="XGBoost with feature engineering v2",
tags=["production-candidate", "feature-eng-v2"],
git_commit="a3b8c9d",
data_version="v2024-01",
author="[email protected]"
)
with track_experiment(metadata) as exp:
# ... training ...
pass
Best Practices
1. Name Experiments Clearly
# ❌ Bad: Generic names
with track_experiment("exp1"):
...
# ✅ Good: Descriptive names
with track_experiment("xgboost-tuned-depth6-lr0.1"):
...
2. Log Everything
# Log more than you think you need
exp.log_param("random_seed", 42)
exp.log_param("data_version", "2024-01")
exp.log_param("python_version", sys.version)
exp.log_param("sklearn_version", sklearn.__version__)
# Future you will thank present you
3. Document Failures
try:
with track_experiment("neural-net-attempt") as exp:
model.fit(X_train, y_train)
except Exception as e:
exp.log_note(f"FAILED: {str(e)}")
exp.log_note("Reason: Out of memory, need smaller batch size")
exp.set_status("failed")
# Failure documentation prevents repeating mistakes
4. Use Experiment Series
# Related experiments in series
experiments = [
"xgboost-baseline",
"xgboost-tuned-v1",
"xgboost-tuned-v2",
"xgboost-tuned-v3-final"
]
# Track progression and improvements
5. Link to Data Versions
with track_experiment("xgboost-v1") as exp:
exp.log_param("data_commit", "dvc:a3b8c9d")
exp.log_param("data_url", "s3://bucket/data/v2024-01")
# Enables exact reproduction
Integration with SpecWeave
With Increments
# Experiments automatically tied to increment
/sw:inc "0042-recommendation-model"
# All experiments logged to: .specweave/increments/0042.../experiments/
With Living Docs
# Sync experiment findings to docs
/sw:sync-docs update
# Updates: architecture/ml-models.md, runbooks/model-training.md
With GitHub
# Create issue for model retraining
/sw:github:create-issue "Retrain model with Q1 2024 data"
# Links to previous experiments in increment
Examples
Example 1: Baseline Experiments
from specweave import track_experiment
baselines = ["random", "majority", "stratified"]
for strategy in baselines:
with track_experiment(f"baseline-{strategy}") as exp:
model = DummyClassifier(strategy=strategy)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
exp.log_metric("accuracy", accuracy)
exp.log_note(f"Baseline: {strategy}")
# Generates baseline comparison report
Example 2: Hyperparameter Grid Search
from sklearn.model_selection import GridSearchCV
from specweave import track_grid_search
param_grid = {
"n_estimators": [100, 200, 500],
"max_depth": [3, 6, 9]
}
# Automatically logs all combinations
best_model, results = track_grid_search(
XGBClassifier(),
param_grid,
X_train,
y_train,
increment="0042"
)
# Creates visualization of parameter importance
Example 3: Model Comparison
from specweave import compare_models
models = {
"xgboost": XGBClassifier(),
"lightgbm": LGBMClassifier(),
"random-forest": RandomForestClassifier()
}
# Trains and compares all models
comparison = compare_models(
models,
X_train,
y_train,
X_test,
y_test,
increment="0042"
)
# Generates markdown comparison table
Tool Compatibility
MLflow
# Option 1: Pure MLflow (auto-configured)
import mlflow
mlflow.set_tracking_uri(".specweave/increments/0042.../experiments")
# Option 2: SpecWeave wrapper (recommended)
from specweave import mlflow as sw_mlflow
with sw_mlflow.start_run("xgboost"):
# Logs to both MLflow and increment docs
pass
Weights & Biases
# Option 1: Pure wandb
import wandb
wandb.init(project="0042-recommendation-model")
# Option 2: SpecWeave wrapper (recommended)
from specweave import wandb as sw_wandb
run = sw_wandb.init(increment="0042", name="xgboost")
# Syncs to increment folder + W&B dashboard
TensorBoard
from specweave import TensorBoardCallback
# Keras callback
model.fit(
X_train,
y_train,
callbacks=[
TensorBoardCallback(
increment="0042",
log_dir=".specweave/increments/0042.../tensorboard"
)
]
)
Commands
# List all experiments in increment
/ml:list-experiments 0042
# Compare experiments
/ml:compare-experiments 0042
# Load experiment details
/ml:show-experiment exp-003-xgboost
# Export experiment data
/ml:export-experiments 0042 --format csv
Tips
- Start tracking early - Track from first experiment, not after 20 failed attempts
- Tag production models -
exp.add_tag("production")for deployed models - Version everything - Data, code, environment, dependencies
- Document decisions - Why model A over model B (not just metrics)
- Prune old experiments - Archive experiments >6 months old
Advanced: Multi-Stage Experiments
For complex pipelines with multiple stages:
from specweave import ExperimentPipeline
pipeline = ExperimentPipeline("recommendation-full-pipeline")
# Stage 1: Data preprocessing
with pipeline.stage("preprocessing") as stage:
stage.log_metric("rows_before", len(df))
df_clean = preprocess(df)
stage.log_metric("rows_after", len(df_clean))
# Stage 2: Feature engineering
with pipeline.stage("features") as stage:
features = engineer_features(df_clean)
stage.log_metric("num_features", features.shape[1])
# Stage 3: Model training
with pipeline.stage("training") as stage:
model = train_model(features)
stage.log_metric("accuracy", accuracy)
# Logs entire pipeline with stage dependencies
Integration Points
- ml-pipeline-orchestrator: Auto-tracks experiments during pipeline execution
- model-evaluator: Uses experiment data for model comparison
- ml-engineer agent: Reviews experiment results and suggests improvements
- Living docs: Syncs experiment findings to architecture docs
This skill ensures ML experimentation is never lost, always reproducible, and well-documented.
Weekly Installs
9
Repository
anton-abyzov/specweaveInstalled on
claude-code8
opencode6
cursor6
codex6
antigravity6
gemini-cli6