monitoring-dashboard
Monitoring Dashboard
Purpose: Provide complete monitoring dashboard templates and setup scripts for ML training with TensorBoard and Weights & Biases (WandB).
Activation Triggers:
- Setting up training monitoring dashboards
- Tracking experiments and metrics in real-time
- Comparing multiple training runs
- Visualizing hyperparameters and results
- Integrating monitoring into existing training pipelines
- Logging custom metrics, images, and model artifacts
Key Resources:
scripts/setup-tensorboard.sh- Install and configure TensorBoardscripts/setup-wandb.sh- Install and configure Weights & Biasesscripts/launch-monitoring.sh- Launch monitoring dashboardstemplates/tensorboard-config.yaml- TensorBoard configuration templatetemplates/wandb-config.py- WandB integration templatetemplates/logging-config.json- Unified logging configurationexamples/tensorboard-integration.md- Complete TensorBoard integration guideexamples/wandb-integration.md- Complete WandB integration guide
Quick Start
1. Choose Monitoring Solution
TensorBoard (Local/Open Source):
- Free, runs locally
- Best for: Single-user development, offline work
- Features: Metrics, histograms, graphs, images, embeddings
- Storage: Local filesystem
Weights & Biases (Cloud/Collaboration):
- Free tier available, cloud-hosted
- Best for: Team collaboration, experiment comparison, production
- Features: All TensorBoard features + collaboration, alerts, reports
- Storage: Cloud with unlimited history
Both (Recommended for Production):
- Use TensorBoard for local development
- Use WandB for team collaboration and production tracking
2. Setup TensorBoard
# Install and configure TensorBoard
./scripts/setup-tensorboard.sh
# Launch TensorBoard
./scripts/launch-monitoring.sh tensorboard --logdir ./runs
Access: Open browser to http://localhost:6006
3. Setup Weights & Biases
# Install and configure WandB
./scripts/setup-wandb.sh
# Login with API key
wandb login
# Launch monitoring
./scripts/launch-monitoring.sh wandb
Access: Dashboard at https://wandb.ai/your-username/your-project
TensorBoard Integration
Basic Setup
Template: templates/tensorboard-config.yaml
from torch.utils.tensorboard import SummaryWriter
import datetime
# Create TensorBoard writer
log_dir = f"runs/experiment_{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}"
writer = SummaryWriter(log_dir=log_dir)
# Log scalar metrics
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/validation', val_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
writer.add_scalar('Accuracy/validation', val_acc, epoch)
# Log learning rate
writer.add_scalar('Learning_Rate', optimizer.param_groups[0]['lr'], epoch)
# Close writer when done
writer.close()
Advanced Logging
Histograms (Weight Distributions):
# Log model weights
for name, param in model.named_parameters():
writer.add_histogram(f'weights/{name}', param, epoch)
writer.add_histogram(f'gradients/{name}', param.grad, epoch)
Images:
# Log sample predictions
writer.add_image('predictions', image_grid, epoch)
writer.add_images('batch_samples', image_batch, epoch)
Text:
# Log hyperparameters as text
config_text = '\n'.join([f'{k}: {v}' for k, v in config.items()])
writer.add_text('hyperparameters', config_text, 0)
Model Graph:
# Log model architecture
writer.add_graph(model, input_tensor)
Embeddings (t-SNE, PCA):
# Visualize embeddings
writer.add_embedding(embeddings, metadata=labels, label_img=images)
Launch TensorBoard
# Basic launch
tensorboard --logdir runs
# Specify port
tensorboard --logdir runs --port 6007
# Load faster (sample data)
tensorboard --logdir runs --samples_per_plugin scalars=1000
# Enable reload
tensorboard --logdir runs --reload_interval 5
Weights & Biases Integration
Basic Setup
Template: templates/wandb-config.py
import wandb
# Initialize WandB run
wandb.init(
project="my-ml-project",
name=f"experiment-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
config={
"learning_rate": 0.001,
"epochs": 100,
"batch_size": 32,
"model": "resnet50",
"dataset": "imagenet"
}
)
# Log metrics
wandb.log({
"train_loss": train_loss,
"val_loss": val_loss,
"train_acc": train_acc,
"val_acc": val_acc,
"epoch": epoch
})
# Finish run
wandb.finish()
Advanced Features
Log Media:
# Log images
wandb.log({"predictions": [wandb.Image(img, caption=f"Pred: {pred}")]})
# Log tables
table = wandb.Table(columns=["epoch", "loss", "accuracy"], data=data)
wandb.log({"results_table": table})
# Log audio
wandb.log({"audio": wandb.Audio(audio_array, sample_rate=16000)})
# Log videos
wandb.log({"video": wandb.Video(video_path, fps=30)})
Track Model Artifacts:
# Save model checkpoint
artifact = wandb.Artifact('model-checkpoint', type='model')
artifact.add_file('model.pth')
wandb.log_artifact(artifact)
# Load model from artifact
artifact = wandb.use_artifact('model-checkpoint:latest')
model_path = artifact.download()
Hyperparameter Sweeps:
# Define sweep configuration
sweep_config = {
'method': 'bayes',
'metric': {'name': 'val_loss', 'goal': 'minimize'},
'parameters': {
'learning_rate': {'min': 0.0001, 'max': 0.1},
'batch_size': {'values': [16, 32, 64]},
'optimizer': {'values': ['adam', 'sgd', 'adamw']}
}
}
# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")
# Run sweep agent
wandb.agent(sweep_id, function=train_model, count=10)
Custom Charts:
# Create custom plot
data = [[x, y] for (x, y) in zip(x_values, y_values)]
table = wandb.Table(data=data, columns=["x", "y"])
wandb.log({
"custom_plot": wandb.plot.line(table, "x", "y", title="Custom Plot")
})
Alerts:
# Alert on metric threshold
if val_loss < 0.1:
wandb.alert(
title="Low Validation Loss",
text=f"Validation loss dropped to {val_loss:.4f}",
level=wandb.AlertLevel.INFO
)
Unified Logging Configuration
Template: templates/logging-config.json
Use this configuration to log to both TensorBoard and WandB simultaneously:
import wandb
from torch.utils.tensorboard import SummaryWriter
class UnifiedLogger:
def __init__(self, project_name, experiment_name, config):
# TensorBoard
self.tb_writer = SummaryWriter(
log_dir=f"runs/{experiment_name}"
)
# WandB
wandb.init(
project=project_name,
name=experiment_name,
config=config
)
def log_metrics(self, metrics_dict, step):
"""Log to both TensorBoard and WandB"""
# TensorBoard
for key, value in metrics_dict.items():
self.tb_writer.add_scalar(key, value, step)
# WandB
wandb.log(metrics_dict, step=step)
def log_images(self, images_dict, step):
"""Log images to both platforms"""
for key, image in images_dict.items():
# TensorBoard
self.tb_writer.add_image(key, image, step)
# WandB
wandb.log({key: wandb.Image(image)}, step=step)
def log_model(self, model, input_sample):
"""Log model architecture"""
# TensorBoard graph
self.tb_writer.add_graph(model, input_sample)
# WandB watches gradients
wandb.watch(model, log="all", log_freq=100)
def close(self):
"""Close both loggers"""
self.tb_writer.close()
wandb.finish()
# Usage
logger = UnifiedLogger(
project_name="my-project",
experiment_name="exp-001",
config={"lr": 0.001, "batch_size": 32}
)
logger.log_metrics({
"train_loss": 0.5,
"val_loss": 0.6
}, step=epoch)
logger.close()
Common Monitoring Patterns
1. Training Loop Integration
for epoch in range(num_epochs):
# Training phase
model.train()
train_loss = 0
for batch_idx, (data, target) in enumerate(train_loader):
loss = train_step(model, data, target, optimizer)
train_loss += loss.item()
# Log batch-level metrics
global_step = epoch * len(train_loader) + batch_idx
logger.log_metrics({
"batch_loss": loss.item(),
"learning_rate": optimizer.param_groups[0]['lr']
}, step=global_step)
# Validation phase
model.eval()
val_loss, val_acc = validate(model, val_loader)
# Log epoch-level metrics
logger.log_metrics({
"epoch": epoch,
"train_loss": train_loss / len(train_loader),
"val_loss": val_loss,
"val_acc": val_acc
}, step=epoch)
# Log model weights distribution
for name, param in model.named_parameters():
logger.tb_writer.add_histogram(f'weights/{name}', param, epoch)
2. Experiment Comparison
TensorBoard:
# Compare multiple runs
tensorboard --logdir_spec \
exp1:runs/experiment_1,\
exp2:runs/experiment_2,\
exp3:runs/experiment_3
WandB:
# Automatically compares all runs in project dashboard
# Filter and group runs by tags, config values, or custom fields
3. Real-Time Monitoring
TensorBoard:
# Auto-reload new data
tensorboard --logdir runs --reload_interval 5
WandB:
# Real-time by default
# Enable email/slack alerts for key metrics
wandb.alert(
title="Training Alert",
text=f"Accuracy reached {acc:.2%}",
level=wandb.AlertLevel.INFO
)
Best Practices
1. Metric Naming Conventions
Organize by category:
# Good: Hierarchical naming
"Loss/train"
"Loss/validation"
"Accuracy/train"
"Accuracy/validation"
"Metrics/precision"
"Metrics/recall"
# Bad: Flat naming
"train_loss"
"validation_loss"
"train_accuracy"
2. Logging Frequency
Guidelines:
- Scalars: Every batch or every N batches
- Histograms: Every epoch
- Images: Every epoch or every N epochs
- Model graph: Once at start
- Embeddings: Once per major checkpoint
# Log batch metrics every 10 batches
if batch_idx % 10 == 0:
logger.log_metrics({"batch_loss": loss}, step)
# Log epoch metrics
if batch_idx == len(train_loader) - 1:
logger.log_metrics({"epoch_loss": epoch_loss}, epoch)
# Log images every 5 epochs
if epoch % 5 == 0:
logger.log_images({"samples": sample_images}, epoch)
3. Disk Space Management
TensorBoard:
# Limit log retention
find runs/ -type d -mtime +30 -exec rm -rf {} +
# Compress old logs
tar -czf archive_$(date +%Y%m%d).tar.gz runs/old_experiments/
rm -rf runs/old_experiments/
WandB:
# Cloud storage handles retention
# Configure retention in project settings
# Download important runs for local backup
wandb.restore('model.pth', run_path="user/project/run_id")
4. Security & Privacy
TensorBoard:
# Restrict access to localhost only
tensorboard --logdir runs --host 127.0.0.1
# Or use SSH tunnel for remote access
ssh -L 6006:localhost:6006 user@remote-server
WandB:
# Use private projects
wandb.init(project="my-project", entity="private-team")
# Disable cloud sync for sensitive data
wandb.init(mode="offline") # Logs locally only
Troubleshooting
TensorBoard Issues
Problem: Dashboard not updating
# Force reload
tensorboard --logdir runs --reload_interval 1
# Clear cache
rm -rf /tmp/.tensorboard-info/
Problem: Port already in use
# Use different port
tensorboard --logdir runs --port 6007
# Or kill existing process
pkill -f tensorboard
WandB Issues
Problem: Login fails
# Re-login with API key
wandb login --relogin
# Or set via environment
export WANDB_API_KEY=your_api_key
Problem: Slow logging
# Reduce logging frequency
wandb.init(settings=wandb.Settings(
_disable_stats=True, # Disable system metrics
_disable_meta=True # Disable metadata
))
Scripts Usage
Setup TensorBoard
./scripts/setup-tensorboard.sh
# Verifies:
# - Python environment
# - TensorBoard installation
# - Creates default log directory structure
Setup WandB
./scripts/setup-wandb.sh
# Verifies:
# - WandB installation
# - API key configuration
# - Creates wandb config file
Launch Monitoring
# TensorBoard
./scripts/launch-monitoring.sh tensorboard --logdir ./runs --port 6006
# WandB (opens browser to dashboard)
./scripts/launch-monitoring.sh wandb --project my-project
# Both
./scripts/launch-monitoring.sh both --logdir ./runs --project my-project
Resources
Scripts:
setup-tensorboard.sh- Install and configure TensorBoardsetup-wandb.sh- Install and configure WandBlaunch-monitoring.sh- Launch monitoring dashboards
Templates:
tensorboard-config.yaml- TensorBoard setup configurationwandb-config.py- WandB integration templatelogging-config.json- Unified logging configuration
Examples:
tensorboard-integration.md- Complete TensorBoard integrationwandb-integration.md- Complete WandB integration with sweeps
Supported Frameworks: PyTorch, TensorFlow, JAX, Hugging Face Transformers Python Version: 3.8+ Best Practice: Use both TensorBoard (local dev) and WandB (team collaboration)