MLOps Observability

Goal

To implement a "Glass Box" system where every result is Reproducible, every asset has Lineage, and system health is Monitored, Alerted on, and Explained.

Prerequisites

Language: Python
Context: Production monitoring and debugging.
Platform Suggestion: MLflow, SHAP, Evidently, ...

Instructions

1. Guarantee Reproducibility

Consistency is key. For instance:

Randomness: Set seeds for random, numpy, torch, tensorflow.
Environment: Use docker and locked dependencies (uv.lock).
Builds: Use justfile with uv build --build-constraint for deterministic wheels.
Code: Track git commit hash for every run.

2. Track Data Lineage

Know the origin of your data. For instance:

Datasets: Create MLflow Datasets with mlflow.data.from_pandas.
Logging: Log inputs to MLflow context with mlflow.log_input.
Versioning: Version data files (e.g., data/v1.csv) or use DVC.
Transformations: Log preprocessing parameters mapping data versions to model versions.

3. Monitoring & Drift Detection

Watch for silent failures. For instance:

Validation: Use MLflow Evaluate to gate models against quality thresholds.
Drift: Use evidently to compare reference (training) vs current (production) data.
- Detect Data Drift (input distribution changes) and Concept Drift (relationship changes).
System: Enable MLflow System Metrics (log_system_metrics=True) for CPU/GPU.

4. Alerting

Don't stare at dashboards. For instance:

Local: Use plyer for desktop notifications during long training runs.
Production: Use PagerDuty (critical) or Slack (warnings).
Thresholds: Use Static (fixed value) or Dynamic (anomaly detection) rules.
Action: Alerts must link to a dashboard or playbook.

5. Explainability (XAI)

Trust but verify. For instance:

Global: Use Feature Importance (e.g., Random Forest) to understand overall logic.
Local: Use SHAP values to explain individual predictions.
Artifacts: Save explanations (plots/tables) as MLflow artifacts.

6. Infrastructure & Costs

Optimize resources. For instance:

Tags: Tag runs with project, env, user.
Costs: Log run_time and instance type to estimate ROI.

Self-Correction Checklist

Seeds: Are random seeds fixed?
Inputs: Are input datasets logged to MLflow?
System Metrics: Is log_system_metrics enabled?
Explanations: Are SHAP values generated?
Alerts: Are thresholds defined for failures?