ML Engineering Guide

Production-grade ML/AI systems, MLOps, and model deployment.

When to Use

Deploying ML models to production
Building ML platforms and infrastructure
Implementing MLOps pipelines
Integrating LLMs into production systems
Setting up model monitoring and drift detection

Tech Stack

Category	Tools
ML Frameworks	PyTorch, TensorFlow, Scikit-learn, XGBoost
LLM Frameworks	LangChain, LlamaIndex, DSPy
Data Tools	Spark, Airflow, dbt, Kafka, Databricks
Deployment	Docker, Kubernetes, AWS/GCP/Azure
Monitoring	MLflow, Weights & Biases, Prometheus
Databases	PostgreSQL, BigQuery, Snowflake, Pinecone

Production Patterns

Model Deployment Pipeline

# Model serving with FastAPI
from fastapi import FastAPI
import torch

app = FastAPI()
model = torch.load("model.pth")

@app.post("/predict")
async def predict(data: dict):
    tensor = preprocess(data)
    with torch.no_grad():
        prediction = model(tensor)
    return {"prediction": prediction.tolist()}

Feature Store Integration

# Feast feature store
from feast import FeatureStore

store = FeatureStore(repo_path=".")
features = store.get_online_features(
    features=["user_features:age", "user_features:location"],
    entity_rows=[{"user_id": 123}]
).to_dict()

Model Monitoring

# Drift detection
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_df, current_data=curr_df)

MLOps Best Practices

Development

Test-driven development for ML pipelines
Version control models and data
Reproducible experiments with MLflow

Production

A/B testing infrastructure
Canary deployments for models
Automated retraining pipelines
Model monitoring and drift detection

Performance Targets

Metric	Target
P50 Latency	< 50ms
P95 Latency	< 100ms
P99 Latency	< 200ms
Throughput	> 1000 RPS
Availability	99.9%

LLM Integration Patterns

RAG System

# Basic RAG with LangChain
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA

vectorstore = Pinecone.from_existing_index(
    index_name="docs",
    embedding=OpenAIEmbeddings()
)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

Prompt Management

# Structured prompts with DSPy
import dspy

class QA(dspy.Signature):
    """Answer questions based on context."""
    context = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField()

qa = dspy.Predict(QA)

Common Commands

# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/

# Training
python scripts/train.py --config prod.yaml
mlflow run . -P epochs=10

# Deployment
docker build -t model:v1 .
kubectl apply -f k8s/model-serving.yaml

# Monitoring
mlflow ui --port 5000

Security & Compliance

Authentication for model endpoints
Data encryption (at rest & in transit)
PII handling and anonymization
GDPR/CCPA compliance
Model access audit logging

ml-engineering

ML Engineering Guide

When to Use

Tech Stack

Production Patterns

Model Deployment Pipeline

Feature Store Integration

Model Monitoring

MLOps Best Practices

Development

Production

Performance Targets

LLM Integration Patterns

RAG System

Prompt Management

Common Commands

Security & Compliance