mlops-engineer
SKILL.md
MLOps Engineer
Purpose
Provides expertise in Machine Learning Operations, bridging data science and DevOps practices. Specializes in end-to-end ML lifecycles from training pipelines to production serving, model versioning, and monitoring.
When to Use
- Building ML training and serving pipelines
- Implementing model versioning and registry
- Setting up feature stores
- Deploying models to production
- Monitoring model performance and drift
- Automating ML workflows (CI/CD for ML)
- Implementing A/B testing for models
- Managing experiment tracking
Quick Start
Invoke this skill when:
- Building ML pipelines and workflows
- Deploying models to production
- Setting up model versioning and registry
- Implementing feature stores
- Monitoring production ML systems
Do NOT invoke when:
- Model development and training → use
/ml-engineer - Data pipeline ETL → use
/data-engineer - Kubernetes infrastructure → use
/kubernetes-specialist - General CI/CD without ML → use
/devops-engineer
Decision Framework
ML Lifecycle Stage?
├── Experimentation
│ └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│ └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│ └── MLflow Registry/Vertex Model Registry
├── Serving
│ ├── Batch → Spark/Dataflow
│ └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
└── Evidently/Fiddler/custom metrics
Core Workflows
1. ML Pipeline Setup
- Define pipeline stages (data prep, training, eval)
- Choose orchestrator (Kubeflow, Airflow, Vertex)
- Containerize each pipeline step
- Implement artifact storage
- Add experiment tracking
- Configure automated retraining triggers
2. Model Deployment
- Register model in model registry
- Build serving container
- Deploy to serving infrastructure
- Configure autoscaling
- Implement canary/shadow deployment
- Set up monitoring and alerts
3. Model Monitoring
- Define key metrics (latency, throughput, accuracy)
- Implement data drift detection
- Set up prediction monitoring
- Create alerting thresholds
- Build dashboards for visibility
- Automate retraining triggers
Best Practices
- Version everything: code, data, models, configs
- Use feature stores for consistency between training and serving
- Implement CI/CD specifically designed for ML workflows
- Monitor data drift and model performance continuously
- Use canary deployments for model rollouts
- Keep training and serving environments consistent
Anti-Patterns
| Anti-Pattern | Problem | Correct Approach |
|---|---|---|
| Manual deployments | Error-prone, slow | Automated ML CI/CD |
| Training-serving skew | Prediction errors | Feature stores |
| No model versioning | Can't reproduce or rollback | Model registry |
| Ignoring data drift | Silent degradation | Continuous monitoring |
| Notebook-to-production | Unmaintainable | Proper pipeline code |
Weekly Installs
1
Repository
anton-abyzov/specweaveInstalled on
windsurf1
opencode1
codex1
claude-code1
antigravity1
gemini-cli1