ai-ml-expert
SKILL.md
AI/ML Expert
Core Framework Guidelines
PyTorch
When reviewing or writing PyTorch code, apply these guidelines:
- Use
torch.nn.Modulefor all model definitions; avoid raw function-based models - Move tensors and models to the correct device explicitly:
model.to(device),tensor.to(device) - Use
model.train()andmodel.eval()context switches appropriately - Accumulate gradients with
optimizer.zero_grad()at the top of the training loop - Use
torch.no_grad()or@torch.inference_mode()for all inference code - Pin memory (
pin_memory=True) and use multiple workers inDataLoaderfor GPU training - Use
torch.compile()(PyTorch 2.x) for production inference speedups - Prefer
F.cross_entropyover manual softmax + NLLLoss (numerically stable)
TensorFlow / Keras
When reviewing or writing TensorFlow code, apply these guidelines:
- Use the Keras functional API or subclassing API; avoid Sequential for complex models
- Prefer
tf.data.Datasetpipelines over manual batching for scalability - Use
tf.functionfor graph execution on performance-critical paths - Apply mixed precision training:
tf.keras.mixed_precision.set_global_policy('mixed_float16') - Use
tf.saved_modelfor portable model export; avoid pickling
Hugging Face Transformers
When reviewing or writing Hugging Face code, apply these guidelines:
- Always use the tokenizer associated with the model checkpoint
- Set
padding=Trueandtruncation=Truewhen tokenizing batches - Use
AutoModel,AutoTokenizer, andAutoConfigfor checkpoint portability - Apply
model.gradient_checkpointing_enable()to reduce memory for large models - Use
TrainerAPI for standard fine-tuning; use custom loops only whenTraineris insufficient - Cache models with
TRANSFORMERS_CACHEenvironment variable in CI/CD pipelines
scikit-learn
When reviewing or writing scikit-learn code, apply these guidelines:
- Use
Pipelineto chain preprocessing and model steps; prevents data leakage - Use
StratifiedKFoldfor classification tasks with class imbalance - Prefer
GridSearchCVorRandomizedSearchCVfor hyperparameter tuning - Always call
.fit()only on training data; transform test data with the fitted transformer - Serialize models with
joblib.dump/joblib.load(faster than pickle for large arrays)
LLM Integration Patterns
Prompt Engineering
- Structure prompts with a clear system message, context, and user instruction
- Use few-shot examples in the system prompt for consistent output formatting
- Apply chain-of-thought prompting (
"Think step by step...") for complex reasoning tasks - Set
temperature=0for deterministic, fact-based outputs; increase for creative tasks - Manage token budgets explicitly: estimate prompt tokens before sending
- Implement output parsing with structured formats (JSON mode, XML tags)
RAG Pipelines
# Standard RAG pipeline components
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS # or Chroma, Pinecone, Weaviate
from langchain.chains import RetrievalQA
# 1. Embed and index documents
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vectorstore = FAISS.from_documents(documents, embeddings)
# 2. Retrieve relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 3. Generate with retrieved context
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
RAG best practices:
- Chunk documents at natural boundaries (paragraphs, sections), not fixed character counts
- Use hybrid retrieval: combine dense embeddings with sparse BM25 for better recall
- Implement semantic caching for repeated queries to reduce latency and cost
- Validate retrieved context relevance before passing to the LLM
- Store metadata alongside embeddings for filtering (date, source, author)
LangChain / LangGraph
- Use
LCEL(LangChain Expression Language) for composable chains - Apply
RunnableParallelfor concurrent retrieval steps - Use
LangGraphfor stateful multi-agent workflows with cycles - Implement retry logic with
RunnableRetryfor unreliable external calls - Trace and evaluate chains with LangSmith in development
Training Loop Standards
# Standard PyTorch training loop with best practices
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
optimizer.zero_grad()
inputs, labels = batch["input_ids"].to(device), batch["labels"].to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # gradient clipping
optimizer.step()
scheduler.step()
# Validation loop
model.eval()
with torch.no_grad():
for batch in val_dataloader:
# evaluate...
Key standards:
- Proper train/validation/test splits: 80/10/10 or stratified for imbalanced datasets
- Gradient clipping (
max_norm=1.0) for stability in Transformer training - Learning rate scheduling: cosine annealing with warmup for Transformers
- Early stopping based on validation loss, not training loss
- Checkpoint the best model by validation metric, not the final epoch
Fine-Tuning Standards
Full Fine-Tuning
- Reduce learning rate 10-100x compared to training from scratch
- Freeze early layers; fine-tune upper layers and task head first
- Use discriminative learning rates: lower LR for frozen layers, higher for new layers
- Apply label smoothing (
smoothing=0.1) to reduce overconfidence
Parameter-Efficient Fine-Tuning (PEFT)
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # LoRA rank
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters() # verify < 1% parameters trainable
PEFT guidelines:
- Use LoRA rank
r=8tor=64; higher rank = more capacity, more memory - QLoRA (4-bit quantization + LoRA) for fine-tuning 7B+ models on consumer GPUs
- Merge adapter weights before serving to eliminate inference overhead
- Prefer adapter-based methods over full fine-tuning for limited data (< 10K examples)
MLOps and Experiment Tracking
MLflow
import mlflow
with mlflow.start_run():
mlflow.log_params({"learning_rate": lr, "batch_size": bs, "epochs": epochs})
mlflow.log_metrics({"train_loss": loss, "val_accuracy": acc}, step=epoch)
mlflow.pytorch.log_model(model, "model")
Weights & Biases
import wandb
wandb.init(project="my-project", config={"lr": 1e-4, "epochs": 10})
wandb.log({"train_loss": loss, "val_f1": f1_score})
wandb.finish()
MLOps standards:
- Log every hyperparameter and dataset version before training starts
- Track system metrics (GPU utilization, memory, throughput) alongside model metrics
- Version datasets with DVC or Delta Lake; never overwrite raw data
- Use reproducible seeds:
torch.manual_seed(42),np.random.seed(42),random.seed(42) - Register production models in a model registry with stage gates (Staging → Production)
Model Evaluation Standards
Metrics by Task Type
| Task | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary Classification | AUC-ROC, F1, Precision/Recall | Calibration (Brier Score) |
| Multi-class | Macro F1, Weighted F1, Cohen's Kappa | Confusion Matrix |
| Regression | RMSE, MAE, R² | Residual Analysis |
| NLP Generation | BLEU, ROUGE, BERTScore | Human Evaluation |
| Ranking/Retrieval | NDCG@k, MRR, MAP | Hit Rate@k |
| LLM Evaluation | LLM-as-judge, exact match, pass@k | Hallucination Rate |
Evaluation Best Practices
- Never tune hyperparameters on the test set; use a held-out validation set
- Report confidence intervals (bootstrap or cross-validation) for all metrics
- Disaggregate metrics by subgroup for fairness analysis
- Use statistical significance tests (McNemar, paired t-test) when comparing models
- Establish a simple baseline before reporting model results
Production ML Systems
Model Deployment
- Export to ONNX for cross-platform inference:
torch.onnx.export(model, ...) - Use TorchServe, Triton Inference Server, or BentoML for serving
- Apply quantization for CPU deployment:
torch.quantization.quantize_dynamic(model, ...) - Set up batching with a maximum batch size and timeout for throughput vs latency tradeoffs
- Use model warming (pre-load and dummy inference) to eliminate cold-start latency
Monitoring and Drift Detection
# Example: data drift detection with Evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html("drift_report.html")
Monitoring standards:
- Track feature distribution drift (KS test, PSI) on a daily schedule
- Alert on prediction distribution shift (concept drift)
- Log and sample model inputs/outputs for downstream evaluation
- Implement shadow mode (run new model alongside production, compare outputs)
- Define retraining triggers based on drift thresholds, not fixed schedules
Data Preprocessing Standards
# Proper train/test split to avoid leakage
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y # stratify for classification
)
# Fit scaler ONLY on training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform only, never fit_transform
Standards:
- Separate preprocessing pipeline per data modality (text, image, tabular)
- Validate schema and types before entering the pipeline
- Handle missing values with domain-aware strategies (median, mode, forward-fill)
- Detect and document outliers; do not silently remove them
- Apply augmentation only to training data, never validation or test data
Iron Laws
- ALWAYS fix random seeds and log all hyperparameters before training — non-reproducible experiments cannot be shared, audited, or debugged; use
torch.manual_seed(42),np.random.seed(42),random.seed(42)and log via MLflow/W&B. - NEVER fit preprocessing transformers on test data — fit only on training data, then
.transform()test; fitting on test causes data leakage and inflated performance estimates. - ALWAYS evaluate with multiple metrics aligned to business goals — never report accuracy alone on imbalanced datasets; use F1, precision-recall curve, and ROC-AUC at minimum.
- NEVER tune hyperparameters on the test set — use a held-out validation set for tuning; the test set is a one-time final evaluation only.
- ALWAYS establish a simple baseline before reporting model results — a heuristic or random baseline is mandatory; without it, model quality cannot be assessed.
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Ignoring class imbalance | Model biased to majority class | Stratified sampling, class weights, SMOTE |
| No validation set | Overfitting undetected | Hold out 10-20% for validation |
| Optimizing a single metric | Missing failure modes | Multiple metrics (precision, recall, F1, AUC) |
| No baseline comparison | Cannot assess model quality | Establish heuristic baseline before ML |
| Accuracy on imbalanced data | Misleading performance estimate | Use F1, precision-recall curve, ROC-AUC |
| Data leakage (test in train) | Inflated performance estimates | Fit on train only; transform test with fitted obj |
| No error analysis | Cannot improve strategically | Analyze failure cases by error type |
| Training without checkpoints | Lost progress on failure | Save best model by validation metric |
| Mutable global random state | Non-reproducible experiments | Fix all seeds; log in experiment metadata |
| Embedding model in application | Cannot update model independently | Serve model via API (REST, gRPC) |
| No latency budget | Inference too slow for production | Profile and set SLO before deployment |
Training a Transformer classifier:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
dataset = dataset.map(tokenize, batched=True)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
compute_metrics=compute_metrics,
)
trainer.train()
Minimal RAG pipeline:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa = RetrievalQA.from_chain_type(ChatOpenAI(model="gpt-4o"), retriever=retriever)
answer = qa.run("What is the refund policy?")
Assigned Agents
This skill is used by:
developer— Implements ML models, data pipelines, and LLM integrationsresearcher— Investigates novel architectures and evaluates research papersarchitect— Designs ML system architecture and deployment topologysecurity-architect— Reviews data privacy, model security, and inference safety
Related Skills
python-backend-expert— NumPy, Pandas, async Python patternscode-analyzer— Static analysis and complexity metrics for ML codedebugging— Systematic debugging for training failures and inference errors
Memory Protocol (MANDATORY)
Before starting:
cat .claude/context/memory/learnings.md
Check for:
- Previously solved ML patterns in this codebase
- Known library version pinning requirements
- Infrastructure constraints (GPU type, memory limits)
After completing:
- New ML pattern or fix →
.claude/context/memory/learnings.md - Training failure root cause →
.claude/context/memory/issues.md - Architecture decision (framework choice, deployment strategy) →
.claude/context/memory/decisions.md
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
Weekly Installs
54
Repository
oimiragieo/agent-studioGitHub Stars
16
First Seen
Jan 27, 2026
Security Audits
Installed on
github-copilot52
gemini-cli51
cursor51
codex50
kimi-cli50
opencode50