MLOps Observability
SKILL.md
MLOps Observability
Goal
To implement a "Glass Box" system where every result is Reproducible, every asset has Lineage, and system health is Monitored, Alerted on, and Explained.
Prerequisites
- Language: Python
- Context: Production monitoring and debugging.
- Platform Suggestion: MLflow, SHAP, Evidently, ...
Instructions
1. Guarantee Reproducibility
Consistency is key. For instance:
- Randomness: Set seeds for
random,numpy,torch,tensorflow. - Environment: Use
dockerand locked dependencies (uv.lock). - Builds: Use
justfilewithuv build --build-constraintfor deterministic wheels. - Code: Track git commit hash for every run.
2. Track Data Lineage
Know the origin of your data. For instance:
- Datasets: Create MLflow Datasets with
mlflow.data.from_pandas. - Logging: Log inputs to MLflow context with
mlflow.log_input. - Versioning: Version data files (e.g.,
data/v1.csv) or use DVC. - Transformations: Log preprocessing parameters mapping data versions to model versions.
3. Monitoring & Drift Detection
Watch for silent failures. For instance:
- Validation: Use
MLflow Evaluateto gate models against quality thresholds. - Drift: Use
evidentlyto comparereference(training) vscurrent(production) data.- Detect Data Drift (input distribution changes) and Concept Drift (relationship changes).
- System: Enable MLflow System Metrics (
log_system_metrics=True) for CPU/GPU.
4. Alerting
Don't stare at dashboards. For instance:
- Local: Use
plyerfor desktop notifications during long training runs. - Production: Use
PagerDuty(critical) orSlack(warnings). - Thresholds: Use Static (fixed value) or Dynamic (anomaly detection) rules.
- Action: Alerts must link to a dashboard or playbook.
5. Explainability (XAI)
Trust but verify. For instance:
- Global: Use Feature Importance (e.g., Random Forest) to understand overall logic.
- Local: Use
SHAPvalues to explain individual predictions. - Artifacts: Save explanations (plots/tables) as MLflow artifacts.
6. Infrastructure & Costs
Optimize resources. For instance:
- Tags: Tag runs with
project,env,user. - Costs: Log
run_timeand instance type to estimate ROI.
Self-Correction Checklist
- Seeds: Are random seeds fixed?
- Inputs: Are input datasets logged to MLflow?
- System Metrics: Is
log_system_metricsenabled? - Explanations: Are SHAP values generated?
- Alerts: Are thresholds defined for failures?