mlops-engineer
MLOps Engineer
Expert in ML infrastructure, automation, and production ML systems.
⚠️ Chunking Rule
Large MLOps platforms = 1000+ lines. Generate ONE component per response:
- Experiment Tracking → 2. Model Registry → 3. Training Pipelines → 4. Deployment → 5. Monitoring
Core Capabilities
ML Pipelines
- Kubeflow Pipelines: K8s-native ML workflows
- Apache Airflow: DAG-based orchestration
- Prefect: Modern dataflow automation
- MLflow Projects: Reproducible ML runs
Model Registry
- Model versioning and staging
- Model metadata and lineage
- Promotion workflows (dev → staging → prod)
- A/B testing infrastructure
Deployment
- Docker containerization
- Kubernetes deployment (Seldon, KServe)
- Serverless (AWS Lambda, GCP Functions)
- Edge deployment (ONNX, TensorRT)
Monitoring
- Model performance drift detection
- Data quality monitoring
- Inference latency tracking
- Alerting and auto-retraining triggers
CI/CD for ML
- Automated testing (unit, integration, model)
- Model validation gates
- Automated retraining pipelines
- GitOps for ML
Best Practices
# Kubeflow Pipeline Example
from kfp import dsl, compiler
@dsl.component
def preprocess_data(input_path: str, output_path: str):
# Data preprocessing logic
pass
@dsl.component
def train_model(data_path: str, model_path: str):
# Training logic
pass
@dsl.pipeline(name="ml-training-pipeline")
def ml_pipeline(input_data: str):
preprocess = preprocess_data(input_path=input_data, output_path="/data/processed")
train = train_model(data_path=preprocess.outputs["output_path"], model_path="/models")
# Model Registry with MLflow
import mlflow.sklearn
# Register model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "fraud-detection-model")
# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="fraud-detection-model",
version=3,
stage="Production"
)
# Kubernetes Deployment (Seldon)
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: fraud-detector
spec:
predictors:
- name: default
replicas: 3
graph:
name: model
type: MODEL
modelUri: s3://models/fraud-v3
DAG Patterns
Training DAG
data_ingestion → validation → preprocessing → training → evaluation → registration
Inference DAG
request → preprocessing → model_inference → postprocessing → response
Monitoring DAG
collect_metrics → detect_drift → alert_if_needed → trigger_retrain
When to Use
- Building ML training pipelines
- Setting up model registry
- Deploying models to production
- ML monitoring and observability
- CI/CD for machine learning
- Infrastructure automation for ML
More from anton-abyzov/specweave
technical-writing
Technical writing expert for API documentation, README files, tutorials, changelog management, and developer documentation. Covers style guides, information architecture, versioning docs, OpenAPI/Swagger, and documentation-as-code. Activates for technical writing, API docs, README, changelog, tutorial writing, documentation, technical communication, style guide, OpenAPI, Swagger, developer docs.
45spec-driven-brainstorming
Spec-driven brainstorming and product discovery expert. Helps teams ideate features, break down epics, conduct story mapping sessions, prioritize using MoSCoW/RICE/Kano, and validate ideas with lean startup methods. Activates for brainstorming, product discovery, story mapping, feature ideation, prioritization, MoSCoW, RICE, Kano model, lean startup, MVP definition, product backlog, feature breakdown.
43kafka-architecture
Apache Kafka architecture expert for cluster design, capacity planning, and high availability. Use when designing Kafka clusters, choosing partition strategies, or sizing brokers for production workloads.
34docusaurus
Docusaurus 3.x documentation framework - MDX authoring, theming, versioning, i18n. Use for documentation sites or spec-weave.com.
29frontend
Expert frontend developer for React, Vue, Angular, and modern JavaScript/TypeScript. Use when creating components, implementing hooks, handling state management, or building responsive web interfaces. Covers React 18+ features, custom hooks, form handling, and accessibility best practices.
29reflect
Self-improving AI memory system that persists learnings across sessions in CLAUDE.md. Use when capturing corrections, remembering user preferences, or extracting patterns from successful implementations. Enables continual learning without starting from zero each conversation.
27