ai-ml-principal-engineer

Installation
SKILL.md

AI/ML Mastery (Senior → Principal)

Operate

  • Start by confirming: objective, success metric, data availability, privacy/security constraints, latency and throughput targets, compute budget, deployment target, and the definition of done.
  • Separate the problem into boundaries: data ingestion, feature/preprocessing, training, evaluation, registry/artifacts, inference API, and operations.
  • Prefer the smallest system that can prove value: a simple baseline model with strong evaluation beats a complex stack with weak discipline.
  • Treat ML work as software engineering: reproducibility, observability, rollback, and failure handling are part of the feature.

The goal is not just a high offline metric. The goal is a model-backed backend that is correct, measurable, operable, and safe in production.

Default Standards

  • Keep notebooks for exploration only; production logic belongs in versioned Python modules and tests.
  • Validate schema, dtypes, ranges, nullability, and label quality at the data boundary.
  • Make training and inference preprocessing identical by sharing explicit pipeline code.
  • Prefer typed config objects and immutable runtime settings.
  • Use structured logging and explicit error taxonomy for data, model, dependency, and serving failures.
  • Define latency budgets, timeout behavior, fallback behavior, and model version strategy before exposing public inference endpoints.
  • Default to simpler baselines before large models; earn complexity with measured gains.

“Bad vs Good” (common production pitfalls)

# ❌ BAD: training and inference use different preprocessing.
train_text = text.lower().strip()
serve_text = text.strip()

# ✅ GOOD: one shared preprocessing pipeline used everywhere.
normalized_text = text_normalizer.normalize(text)
# ❌ BAD: silent fallback hides model loading failures.
try:
    model = load_model(path)
except Exception:
    model = None

# ✅ GOOD: fail explicitly or switch to a known degraded mode.
try:
    model = load_model(path)
except FileNotFoundError as error:
    raise ModelBootstrapError(f"model artifact missing: {path}") from error
# ❌ BAD: unbounded inference call with no deadline.
prediction = client.predict(payload)

# ✅ GOOD: explicit deadline and graceful failure mapping.
prediction = client.predict(payload, timeout=2.0)

Workflow (Feature / Refactor / Bug)

  1. Define the business outcome, online/offline metrics, and failure tolerance.
  2. Establish a reproducible baseline and dataset contract.
  3. Design boundaries between training code, model packaging, and serving code.
  4. Implement the smallest end-to-end slice with tests and evaluation reports.
  5. Validate reproducibility, security, performance, and rollback readiness.
  6. Ship with monitoring for latency, throughput, drift, quality, and cost.

Validation Commands

  • Run python -m pytest.
  • Run python -m ruff check . if Ruff is used.
  • Run python -m mypy src for typed code paths when the repo uses MyPy.
  • Run python -m pytest -k inference for serving-critical tests.
  • Run python -m pytest --maxfail=1 --disable-warnings during local debugging.
  • Run smoke evaluation for the current model artifact before release.
  • Run container build validation if inference is deployed via Docker.

Backend-Oriented ML Guardrails

  • Always version models, prompts, tokenizer assets, and preprocessing artifacts together.
  • Do not call external model providers from request paths without timeouts, retries, budgets, and fallback behavior.
  • Separate online inference from heavy offline batch jobs.
  • Prefer async queue-based processing for expensive enrichment, reranking, or embedding backfills.
  • Protect inference endpoints with payload size limits, authn/authz, and rate limiting.
  • Log request IDs, model version, feature version, and decision metadata without leaking raw sensitive payloads.

Decision Framework: Library Selection

Task Default Choice Use Alternative When
Deep learning training PyTorch TensorFlow for TPU-heavy production, JAX for research-heavy experimentation
Classical/tabular ML scikit-learn XGBoost/LightGBM for stronger tabular baselines, CatBoost for categorical-heavy data
LLM application layer transformers + sentence-transformers vLLM for high-throughput serving, llama.cpp for edge or constrained environments
Data processing pandas polars for larger columnar workloads, dask/spark for distributed pipelines
Experiment tracking MLflow Weights & Biases or Neptune when team workflows require hosted collaboration
Hyperparameter tuning Optuna Ray Tune when you need distributed search orchestration

Architecture Selection Heuristics

Text classification          -> DistilBERT for speed, RoBERTa for stronger accuracy
Embeddings / retrieval       -> sentence-transformers or hosted embedding APIs with evaluation gates
Vision classification        -> ResNet/EfficientNet as baseline, ViT when data and budget justify it
Object detection             -> YOLO for speed, DETR/RT-DETR when workflow favors transformer-based designs
Tabular prediction           -> Logistic regression / XGBoost baseline first, deep tabular only if proven necessary
Recommendation               -> retrieval + ranking pipelines, not a single monolithic model by default
Time series                  -> statistical baseline first, then TFT/PatchTST when complexity is justified

Recommended Project Structure

project/
├── pyproject.toml
├── README.md
├── src/
│   └── app/
│       ├── config/
│       ├── data/
│       ├── features/
│       ├── models/
│       ├── training/
│       ├── evaluation/
│       ├── inference/
│       ├── serving/
│       └── observability/
├── tests/
├── scripts/
├── configs/
├── notebooks/
└── docker/

Reliability, Security, and Operations

  • Make model bootstrap behavior explicit: fail closed, fail open, or degraded mode.
  • Bound input sizes, token counts, image dimensions, and recursion depth for untrusted requests.
  • Prefer queue-based retries over client-side blind retries for expensive inference.
  • Track feature drift, data freshness, and serving skew between training and production.
  • Keep PII out of prompts, logs, traces, and experiment artifacts unless explicitly required and governed.
  • Store secrets and provider credentials in secret managers, never in notebooks or source files.

Training and Evaluation Checklist

  • Define offline and online success metrics before training
  • Fix random seeds when reproducibility matters
  • Check train/validation/test leakage
  • Validate preprocessing parity between train and serve
  • Save model artifact, config, tokenizer, and feature metadata together
  • Record dataset version and experiment version
  • Benchmark latency, throughput, memory, and cost
  • Define rollback or model disable strategy before release

References

Weekly Installs
6
GitHub Stars
5
First Seen
5 days ago