ml-experiment-tracker
ML Experiment Tracker
This skill provides guidance for systematic machine learning experimentation with proper tracking, versioning, and reproducibility practices.
Core Competencies
- Experiment Tracking: MLflow, Weights & Biases (wandb), Neptune, Comet
- Data Versioning: DVC, Delta Lake, LakeFS
- Model Registry: Version control for trained models
- Reproducibility: Environment, code, data, and hyperparameter tracking
Experiment Tracking Fundamentals
What to Track
Every experiment should log:
| Category | Items | Why |
|---|---|---|
| Code | Git commit hash, branch, diff | Reproduce exact code state |
| Data | Dataset version, hash, lineage | Know which data was used |
| Environment | Python version, dependencies, hardware | Reproduce runtime |
| Hyperparameters | All config values | Understand what changed |
| Metrics | Loss, accuracy, custom metrics | Compare performance |
| Artifacts | Models, plots, predictions | Preserve outputs |
Experiment Organization
project/
├── experiments/
│ ├── baseline/ # Initial experiments
│ ├── feature-engineering/ # Data improvements
│ ├── architecture/ # Model changes
│ └── hyperparameter/ # Tuning runs
├── data/
│ ├── raw/ # Original data (versioned)
│ ├── processed/ # Cleaned data
│ └── features/ # Feature store
└── models/
├── staging/ # Candidates
└── production/ # Deployed models
MLflow Patterns
Basic Experiment Logging
import mlflow
# Set experiment (creates if not exists)
mlflow.set_experiment("my-classification-project")
with mlflow.start_run(run_name="baseline-v1"):
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("batch_size", 32)
mlflow.log_param("epochs", 100)
# Training loop
for epoch in range(epochs):
train_loss = train_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
# Log metrics with step
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc
}, step=epoch)
# Log model
mlflow.pytorch.log_model(model, "model")
# Log artifacts (plots, configs)
mlflow.log_artifact("confusion_matrix.png")
mlflow.log_artifact("config.yaml")
Model Registry Workflow
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Training │───▶│ Staging │───▶│ Production │
│ Runs │ │ Review │ │ Deployed │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
Candidate Validated Monitored
Models Models Models
Stages:
- None: Just logged, not registered
- Staging: Candidate for production
- Production: Active serving
- Archived: Historical reference
Weights & Biases Patterns
Project Structure
import wandb
# Initialize with config
config = {
"learning_rate": 0.01,
"architecture": "ResNet50",
"dataset": "imagenet-subset",
"epochs": 100
}
run = wandb.init(
project="image-classification",
group="architecture-experiments", # Group related runs
tags=["baseline", "resnet"],
config=config,
notes="Testing ResNet50 baseline on subset"
)
# Training with automatic logging
for epoch in range(config["epochs"]):
metrics = train_and_eval(model, train_loader, val_loader)
wandb.log(metrics)
# Log media
wandb.log({"predictions": wandb.Image(pred_grid)})
wandb.log({"confusion_matrix": wandb.plot.confusion_matrix(...)})
wandb.finish()
Hyperparameter Sweeps
# sweep_config.yaml
program: train.py
method: bayes # or grid, random
metric:
name: val_accuracy
goal: maximize
parameters:
learning_rate:
distribution: log_uniform_values
min: 0.0001
max: 0.1
batch_size:
values: [16, 32, 64, 128]
optimizer:
values: ["adam", "sgd", "adamw"]
early_terminate:
type: hyperband
min_iter: 10
DVC for Data Versioning
Setup and Usage
# Initialize DVC in git repo
dvc init
# Track large files
dvc add data/training.csv
git add data/training.csv.dvc data/.gitignore
git commit -m "Add training data v1"
# Push to remote storage
dvc remote add -d storage s3://bucket/dvc
dvc push
# Create pipeline
dvc run -n preprocess \
-d src/preprocess.py -d data/raw \
-o data/processed \
python src/preprocess.py
# Reproduce pipeline
dvc repro
DVC Pipeline Definition
# dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/
outs:
- data/processed/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/
params:
- train.epochs
- train.learning_rate
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
Reproducibility Checklist
Code Reproducibility
- Pin git commit for each experiment
- Track uncommitted changes (git diff)
- Version control notebooks (nbstripout)
- Document manual steps
Environment Reproducibility
- Lock dependencies (pip freeze, poetry.lock)
- Specify Python version
- Document CUDA/GPU requirements
- Use containers for full isolation
Data Reproducibility
- Version datasets with DVC or similar
- Document data collection process
- Track preprocessing steps
- Save train/val/test split indices
Training Reproducibility
- Set random seeds (Python, NumPy, PyTorch/TF)
- Log all hyperparameters
- Save model checkpoints
- Document non-deterministic operations
Best Practices
Naming Conventions
experiment: {project}-{objective}
run: {date}-{description}-{variant}
model: {architecture}-{dataset}-{version}
Examples:
experiment: fraud-detection-baseline
run: 2024-01-15-xgboost-tuning-lr001
model: xgboost-transactions-v2.3.1
Comparison Dashboards
Track these metrics for model comparison:
- Primary metric (what you optimize)
- Secondary metrics (constraints)
- Resource usage (training time, memory)
- Inference performance (latency, throughput)
Experiment Documentation
Each significant experiment should document:
- Hypothesis: What change and expected outcome
- Method: What was actually done
- Results: Metrics and observations
- Conclusions: What was learned, next steps
References
references/mlflow-setup.md- MLflow installation and configurationreferences/wandb-patterns.md- Advanced W&B features and sweepsreferences/reproducibility-checklist.md- Detailed reproducibility guide
More from 4444j99/a-i--skills
creative-writing-craft
Craft compelling fiction and creative nonfiction with attention to structure, voice, prose style, and revision. Supports short stories, novel chapters, essays, and hybrid forms. Triggers on creative writing, fiction writing, story craft, prose style, or literary technique requests.
186skill-creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
15freelance-client-ops
Manage freelance and client work professionally—proposals, contracts, scope management, invoicing, and client communication. Covers the business side of creative work. Triggers on freelance, client work, proposals, contracts, pricing, or project scope requests.
14generative-music-composer
Creates algorithmic music composition systems using procedural generation, Markov chains, L-systems, and neural approaches for ambient, adaptive, and experimental music.
12generative-art-algorithms
Create algorithmic and generative art using mathematical patterns, noise functions, particle systems, and procedural generation. Covers flow fields, L-systems, fractals, and creative coding foundations. Triggers on generative art, algorithmic art, creative coding, procedural generation, or mathematical visualization requests.
10interfaith-sacred-geometry
Generate sacred geometry patterns with interfaith symbolism for spiritual visualizations and art. Use when creating visual representations that honor multiple religious traditions, designing meditation aids, building soul journey visualizations, or producing art that bridges sacred traditions through geometric harmony. Triggers on sacred geometry requests, interfaith symbol design, spiritual visualization projects, or multi-tradition sacred art.
8