senior-ml-engineer
SKILL.md
Senior ML Engineer
Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.
Table of Contents
- Model Deployment Workflow
- MLOps Pipeline Setup
- LLM Integration Workflow
- RAG System Implementation
- Model Monitoring
- Reference Documentation
- Tools
Model Deployment Workflow
Deploy a trained model to production with monitoring:
- Export model to standardized format (ONNX, TorchScript, SavedModel)
- Package model with dependencies in Docker container
- Deploy to staging environment
- Run integration tests against staging
- Deploy canary (5% traffic) to production
- Monitor latency and error rates for 1 hour
- Promote to full production if metrics pass
- Validation: p95 latency < 100ms, error rate < 0.1%
Container Template
FROM python:3.11-slim
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model/ /app/model/
COPY src/ /app/src/
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1
EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]
Serving Options
| Option | Latency | Throughput | Use Case |
|---|---|---|---|
| FastAPI + Uvicorn | Low | Medium | REST APIs, small models |
| Triton Inference Server | Very Low | Very High | GPU inference, batching |
| TensorFlow Serving | Low | High | TensorFlow models |
| TorchServe | Low | High | PyTorch models |
| Ray Serve | Medium | High | Complex pipelines, multi-model |
MLOps Pipeline Setup
Establish automated training and deployment:
- Configure feature store (Feast, Tecton) for training data
- Set up experiment tracking (MLflow, Weights & Biases)
- Create training pipeline with hyperparameter logging
- Register model in model registry with version metadata
- Configure staging deployment triggered by registry events
- Set up A/B testing infrastructure for model comparison
- Enable drift monitoring with alerting
- Validation: New models automatically evaluated against baseline
Feature Store Pattern
from feast import Entity, Feature, FeatureView, FileSource
user = Entity(name="user_id", value_type=ValueType.INT64)
user_features = FeatureView(
name="user_features",
entities=["user_id"],
ttl=timedelta(days=1),
features=[
Feature(name="purchase_count_30d", dtype=ValueType.INT64),
Feature(name="avg_order_value", dtype=ValueType.FLOAT),
],
online=True,
source=FileSource(path="data/user_features.parquet"),
)
Retraining Triggers
| Trigger | Detection | Action |
|---|---|---|
| Scheduled | Cron (weekly/monthly) | Full retrain |
| Performance drop | Accuracy < threshold | Immediate retrain |
| Data drift | PSI > 0.2 | Evaluate, then retrain |
| New data volume | X new samples | Incremental update |
LLM Integration Workflow
Integrate LLM APIs into production applications:
- Create provider abstraction layer for vendor flexibility
- Implement retry logic with exponential backoff
- Configure fallback to secondary provider
- Set up token counting and context truncation
- Add response caching for repeated queries
- Implement cost tracking per request
- Add structured output validation with Pydantic
- Validation: Response parses correctly, cost within budget
Provider Abstraction
from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential
class LLMProvider(ABC):
@abstractmethod
def complete(self, prompt: str, **kwargs) -> str:
pass
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
return provider.complete(prompt)
Cost Management
| Provider | Input Cost | Output Cost |
|---|---|---|
| GPT-4 | $0.03/1K | $0.06/1K |
| GPT-3.5 | $0.0005/1K | $0.0015/1K |
| Claude 3 Opus | $0.015/1K | $0.075/1K |
| Claude 3 Haiku | $0.00025/1K | $0.00125/1K |
RAG System Implementation
Build retrieval-augmented generation pipeline:
- Choose vector database (Pinecone, Qdrant, Weaviate)
- Select embedding model based on quality/cost tradeoff
- Implement document chunking strategy
- Create ingestion pipeline with metadata extraction
- Build retrieval with query embedding
- Add reranking for relevance improvement
- Format context and send to LLM
- Validation: Response references retrieved context, no hallucinations
Vector Database Selection
| Database | Hosting | Scale | Latency | Best For |
|---|---|---|---|---|
| Pinecone | Managed | High | Low | Production, managed |
| Qdrant | Both | High | Very Low | Performance-critical |
| Weaviate | Both | High | Low | Hybrid search |
| Chroma | Self-hosted | Medium | Low | Prototyping |
| pgvector | Self-hosted | Medium | Medium | Existing Postgres |
Chunking Strategies
| Strategy | Chunk Size | Overlap | Best For |
|---|---|---|---|
| Fixed | 500-1000 tokens | 50-100 | General text |
| Sentence | 3-5 sentences | 1 sentence | Structured text |
| Semantic | Variable | Based on meaning | Research papers |
| Recursive | Hierarchical | Parent-child | Long documents |
Model Monitoring
Monitor production models for drift and degradation:
- Set up latency tracking (p50, p95, p99)
- Configure error rate alerting
- Implement input data drift detection
- Track prediction distribution shifts
- Log ground truth when available
- Compare model versions with A/B metrics
- Set up automated retraining triggers
- Validation: Alerts fire before user-visible degradation
Drift Detection
from scipy.stats import ks_2samp
def detect_drift(reference, current, threshold=0.05):
statistic, p_value = ks_2samp(reference, current)
return {
"drift_detected": p_value < threshold,
"ks_statistic": statistic,
"p_value": p_value
}
Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| p95 latency | > 100ms | > 200ms |
| Error rate | > 0.1% | > 1% |
| PSI (drift) | > 0.1 | > 0.2 |
| Accuracy drop | > 2% | > 5% |
Reference Documentation
MLOps Production Patterns
references/mlops_production_patterns.md contains:
- Model deployment pipeline with Kubernetes manifests
- Feature store architecture with Feast examples
- Model monitoring with drift detection code
- A/B testing infrastructure with traffic splitting
- Automated retraining pipeline with MLflow
LLM Integration Guide
references/llm_integration_guide.md contains:
- Provider abstraction layer pattern
- Retry and fallback strategies with tenacity
- Prompt engineering templates (few-shot, CoT)
- Token optimization with tiktoken
- Cost calculation and tracking
RAG System Architecture
references/rag_system_architecture.md contains:
- RAG pipeline implementation with code
- Vector database comparison and integration
- Chunking strategies (fixed, semantic, recursive)
- Embedding model selection guide
- Hybrid search and reranking patterns
Tools
Model Deployment Pipeline
python scripts/model_deployment_pipeline.py --model model.pkl --target staging
Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.
RAG System Builder
python scripts/rag_system_builder.py --config rag_config.yaml --analyze
Scaffolds RAG pipeline with vector store integration and retrieval logic.
ML Monitoring Suite
python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy
Sets up drift detection, alerting, and performance dashboards.
Tech Stack
| Category | Tools |
|---|---|
| ML Frameworks | PyTorch, TensorFlow, Scikit-learn, XGBoost |
| LLM Frameworks | LangChain, LlamaIndex, DSPy |
| MLOps | MLflow, Weights & Biases, Kubeflow |
| Data | Spark, Airflow, dbt, Kafka |
| Deployment | Docker, Kubernetes, Triton |
| Databases | PostgreSQL, BigQuery, Pinecone, Redis |
Weekly Installs
7
Repository
openclaw/skillsFirst Seen
4 days ago
Installed on
openclaw6
opencode3
gemini-cli3
claude-code3
amp3
codex3