mlops-engineer

SKILL.md

MLOps Engineer

Expert in ML infrastructure, automation, and production ML systems.

⚠️ Chunking Rule

Large MLOps platforms = 1000+ lines. Generate ONE component per response:

  1. Experiment Tracking → 2. Model Registry → 3. Training Pipelines → 4. Deployment → 5. Monitoring

Core Capabilities

ML Pipelines

  • Kubeflow Pipelines: K8s-native ML workflows
  • Apache Airflow: DAG-based orchestration
  • Prefect: Modern dataflow automation
  • MLflow Projects: Reproducible ML runs

Model Registry

  • Model versioning and staging
  • Model metadata and lineage
  • Promotion workflows (dev → staging → prod)
  • A/B testing infrastructure

Deployment

  • Docker containerization
  • Kubernetes deployment (Seldon, KServe)
  • Serverless (AWS Lambda, GCP Functions)
  • Edge deployment (ONNX, TensorRT)

Monitoring

  • Model performance drift detection
  • Data quality monitoring
  • Inference latency tracking
  • Alerting and auto-retraining triggers

CI/CD for ML

  • Automated testing (unit, integration, model)
  • Model validation gates
  • Automated retraining pipelines
  • GitOps for ML

Best Practices

# Kubeflow Pipeline Example
from kfp import dsl, compiler

@dsl.component
def preprocess_data(input_path: str, output_path: str):
    # Data preprocessing logic
    pass

@dsl.component
def train_model(data_path: str, model_path: str):
    # Training logic
    pass

@dsl.pipeline(name="ml-training-pipeline")
def ml_pipeline(input_data: str):
    preprocess = preprocess_data(input_path=input_data, output_path="/data/processed")
    train = train_model(data_path=preprocess.outputs["output_path"], model_path="/models")
# Model Registry with MLflow
import mlflow.sklearn

# Register model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "fraud-detection-model")

# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=3,
    stage="Production"
)
# Kubernetes Deployment (Seldon)
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: fraud-detector
spec:
  predictors:
    - name: default
      replicas: 3
      graph:
        name: model
        type: MODEL
        modelUri: s3://models/fraud-v3

DAG Patterns

Training DAG

data_ingestion → validation → preprocessing → training → evaluation → registration

Inference DAG

request → preprocessing → model_inference → postprocessing → response

Monitoring DAG

collect_metrics → detect_drift → alert_if_needed → trigger_retrain

When to Use

  • Building ML training pipelines
  • Setting up model registry
  • Deploying models to production
  • ML monitoring and observability
  • CI/CD for machine learning
  • Infrastructure automation for ML
Weekly Installs
12
GitHub Stars
82
First Seen
Jan 25, 2026
Installed on
claude-code10
opencode9
antigravity9
codex9
gemini-cli9
cursor8