ai-ml-expert

SKILL.md

AI/ML Expert

Core Framework Guidelines

PyTorch

When reviewing or writing PyTorch code, apply these guidelines:

  • Use torch.nn.Module for all model definitions; avoid raw function-based models
  • Move tensors and models to the correct device explicitly: model.to(device), tensor.to(device)
  • Use model.train() and model.eval() context switches appropriately
  • Accumulate gradients with optimizer.zero_grad() at the top of the training loop
  • Use torch.no_grad() or @torch.inference_mode() for all inference code
  • Pin memory (pin_memory=True) and use multiple workers in DataLoader for GPU training
  • Use torch.compile() (PyTorch 2.x) for production inference speedups
  • Prefer F.cross_entropy over manual softmax + NLLLoss (numerically stable)

TensorFlow / Keras

When reviewing or writing TensorFlow code, apply these guidelines:

  • Use the Keras functional API or subclassing API; avoid Sequential for complex models
  • Prefer tf.data.Dataset pipelines over manual batching for scalability
  • Use tf.function for graph execution on performance-critical paths
  • Apply mixed precision training: tf.keras.mixed_precision.set_global_policy('mixed_float16')
  • Use tf.saved_model for portable model export; avoid pickling

Hugging Face Transformers

When reviewing or writing Hugging Face code, apply these guidelines:

  • Always use the tokenizer associated with the model checkpoint
  • Set padding=True and truncation=True when tokenizing batches
  • Use AutoModel, AutoTokenizer, and AutoConfig for checkpoint portability
  • Apply model.gradient_checkpointing_enable() to reduce memory for large models
  • Use Trainer API for standard fine-tuning; use custom loops only when Trainer is insufficient
  • Cache models with TRANSFORMERS_CACHE environment variable in CI/CD pipelines

scikit-learn

When reviewing or writing scikit-learn code, apply these guidelines:

  • Use Pipeline to chain preprocessing and model steps; prevents data leakage
  • Use StratifiedKFold for classification tasks with class imbalance
  • Prefer GridSearchCV or RandomizedSearchCV for hyperparameter tuning
  • Always call .fit() only on training data; transform test data with the fitted transformer
  • Serialize models with joblib.dump / joblib.load (faster than pickle for large arrays)

LLM Integration Patterns

Prompt Engineering

  • Structure prompts with a clear system message, context, and user instruction
  • Use few-shot examples in the system prompt for consistent output formatting
  • Apply chain-of-thought prompting ("Think step by step...") for complex reasoning tasks
  • Set temperature=0 for deterministic, fact-based outputs; increase for creative tasks
  • Manage token budgets explicitly: estimate prompt tokens before sending
  • Implement output parsing with structured formats (JSON mode, XML tags)

RAG Pipelines

# Standard RAG pipeline components
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS  # or Chroma, Pinecone, Weaviate
from langchain.chains import RetrievalQA

# 1. Embed and index documents
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

# 2. Retrieve relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 3. Generate with retrieved context
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

RAG best practices:

  • Chunk documents at natural boundaries (paragraphs, sections), not fixed character counts
  • Use hybrid retrieval: combine dense embeddings with sparse BM25 for better recall
  • Implement semantic caching for repeated queries to reduce latency and cost
  • Validate retrieved context relevance before passing to the LLM
  • Store metadata alongside embeddings for filtering (date, source, author)

LangChain / LangGraph

  • Use LCEL (LangChain Expression Language) for composable chains
  • Apply RunnableParallel for concurrent retrieval steps
  • Use LangGraph for stateful multi-agent workflows with cycles
  • Implement retry logic with RunnableRetry for unreliable external calls
  • Trace and evaluate chains with LangSmith in development

Training Loop Standards

# Standard PyTorch training loop with best practices
for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        optimizer.zero_grad()
        inputs, labels = batch["input_ids"].to(device), batch["labels"].to(device)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # gradient clipping
        optimizer.step()
        scheduler.step()

    # Validation loop
    model.eval()
    with torch.no_grad():
        for batch in val_dataloader:
            # evaluate...

Key standards:

  • Proper train/validation/test splits: 80/10/10 or stratified for imbalanced datasets
  • Gradient clipping (max_norm=1.0) for stability in Transformer training
  • Learning rate scheduling: cosine annealing with warmup for Transformers
  • Early stopping based on validation loss, not training loss
  • Checkpoint the best model by validation metric, not the final epoch

Fine-Tuning Standards

Full Fine-Tuning

  • Reduce learning rate 10-100x compared to training from scratch
  • Freeze early layers; fine-tune upper layers and task head first
  • Use discriminative learning rates: lower LR for frozen layers, higher for new layers
  • Apply label smoothing (smoothing=0.1) to reduce overconfidence

Parameter-Efficient Fine-Tuning (PEFT)

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # LoRA rank
    lora_alpha=32,      # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()  # verify < 1% parameters trainable

PEFT guidelines:

  • Use LoRA rank r=8 to r=64; higher rank = more capacity, more memory
  • QLoRA (4-bit quantization + LoRA) for fine-tuning 7B+ models on consumer GPUs
  • Merge adapter weights before serving to eliminate inference overhead
  • Prefer adapter-based methods over full fine-tuning for limited data (< 10K examples)

MLOps and Experiment Tracking

MLflow

import mlflow

with mlflow.start_run():
    mlflow.log_params({"learning_rate": lr, "batch_size": bs, "epochs": epochs})
    mlflow.log_metrics({"train_loss": loss, "val_accuracy": acc}, step=epoch)
    mlflow.pytorch.log_model(model, "model")

Weights & Biases

import wandb

wandb.init(project="my-project", config={"lr": 1e-4, "epochs": 10})
wandb.log({"train_loss": loss, "val_f1": f1_score})
wandb.finish()

MLOps standards:

  • Log every hyperparameter and dataset version before training starts
  • Track system metrics (GPU utilization, memory, throughput) alongside model metrics
  • Version datasets with DVC or Delta Lake; never overwrite raw data
  • Use reproducible seeds: torch.manual_seed(42), np.random.seed(42), random.seed(42)
  • Register production models in a model registry with stage gates (Staging → Production)

Model Evaluation Standards

Metrics by Task Type

Task Primary Metrics Secondary Metrics
Binary Classification AUC-ROC, F1, Precision/Recall Calibration (Brier Score)
Multi-class Macro F1, Weighted F1, Cohen's Kappa Confusion Matrix
Regression RMSE, MAE, R² Residual Analysis
NLP Generation BLEU, ROUGE, BERTScore Human Evaluation
Ranking/Retrieval NDCG@k, MRR, MAP Hit Rate@k
LLM Evaluation LLM-as-judge, exact match, pass@k Hallucination Rate

Evaluation Best Practices

  • Never tune hyperparameters on the test set; use a held-out validation set
  • Report confidence intervals (bootstrap or cross-validation) for all metrics
  • Disaggregate metrics by subgroup for fairness analysis
  • Use statistical significance tests (McNemar, paired t-test) when comparing models
  • Establish a simple baseline before reporting model results

Production ML Systems

Model Deployment

  • Export to ONNX for cross-platform inference: torch.onnx.export(model, ...)
  • Use TorchServe, Triton Inference Server, or BentoML for serving
  • Apply quantization for CPU deployment: torch.quantization.quantize_dynamic(model, ...)
  • Set up batching with a maximum batch size and timeout for throughput vs latency tradeoffs
  • Use model warming (pre-load and dummy inference) to eliminate cold-start latency

Monitoring and Drift Detection

# Example: data drift detection with Evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html("drift_report.html")

Monitoring standards:

  • Track feature distribution drift (KS test, PSI) on a daily schedule
  • Alert on prediction distribution shift (concept drift)
  • Log and sample model inputs/outputs for downstream evaluation
  • Implement shadow mode (run new model alongside production, compare outputs)
  • Define retraining triggers based on drift thresholds, not fixed schedules

Data Preprocessing Standards

# Proper train/test split to avoid leakage
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify for classification
)

# Fit scaler ONLY on training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # transform only, never fit_transform

Standards:

  • Separate preprocessing pipeline per data modality (text, image, tabular)
  • Validate schema and types before entering the pipeline
  • Handle missing values with domain-aware strategies (median, mode, forward-fill)
  • Detect and document outliers; do not silently remove them
  • Apply augmentation only to training data, never validation or test data

Iron Laws

  1. ALWAYS fix random seeds and log all hyperparameters before training — non-reproducible experiments cannot be shared, audited, or debugged; use torch.manual_seed(42), np.random.seed(42), random.seed(42) and log via MLflow/W&B.
  2. NEVER fit preprocessing transformers on test data — fit only on training data, then .transform() test; fitting on test causes data leakage and inflated performance estimates.
  3. ALWAYS evaluate with multiple metrics aligned to business goals — never report accuracy alone on imbalanced datasets; use F1, precision-recall curve, and ROC-AUC at minimum.
  4. NEVER tune hyperparameters on the test set — use a held-out validation set for tuning; the test set is a one-time final evaluation only.
  5. ALWAYS establish a simple baseline before reporting model results — a heuristic or random baseline is mandatory; without it, model quality cannot be assessed.

Anti-Patterns

Anti-Pattern Problem Fix
Ignoring class imbalance Model biased to majority class Stratified sampling, class weights, SMOTE
No validation set Overfitting undetected Hold out 10-20% for validation
Optimizing a single metric Missing failure modes Multiple metrics (precision, recall, F1, AUC)
No baseline comparison Cannot assess model quality Establish heuristic baseline before ML
Accuracy on imbalanced data Misleading performance estimate Use F1, precision-recall curve, ROC-AUC
Data leakage (test in train) Inflated performance estimates Fit on train only; transform test with fitted obj
No error analysis Cannot improve strategically Analyze failure cases by error type
Training without checkpoints Lost progress on failure Save best model by validation metric
Mutable global random state Non-reproducible experiments Fix all seeds; log in experiment metadata
Embedding model in application Cannot update model independently Serve model via API (REST, gRPC)
No latency budget Inference too slow for production Profile and set SLO before deployment

Training a Transformer classifier:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)

dataset = dataset.map(tokenize, batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics,
)
trainer.train()

Minimal RAG pipeline:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa = RetrievalQA.from_chain_type(ChatOpenAI(model="gpt-4o"), retriever=retriever)
answer = qa.run("What is the refund policy?")

Assigned Agents

This skill is used by:

  • developer — Implements ML models, data pipelines, and LLM integrations
  • researcher — Investigates novel architectures and evaluates research papers
  • architect — Designs ML system architecture and deployment topology
  • security-architect — Reviews data privacy, model security, and inference safety

Related Skills

  • python-backend-expert — NumPy, Pandas, async Python patterns
  • code-analyzer — Static analysis and complexity metrics for ML code
  • debugging — Systematic debugging for training failures and inference errors

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

Check for:

  • Previously solved ML patterns in this codebase
  • Known library version pinning requirements
  • Infrastructure constraints (GPU type, memory limits)

After completing:

  • New ML pattern or fix → .claude/context/memory/learnings.md
  • Training failure root cause → .claude/context/memory/issues.md
  • Architecture decision (framework choice, deployment strategy) → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

Weekly Installs
54
GitHub Stars
16
First Seen
Jan 27, 2026
Installed on
github-copilot52
gemini-cli51
cursor51
codex50
kimi-cli50
opencode50