NYC
skills/smithery/ai/llm-training

llm-training

SKILL.md

LLM Training

Frameworks and techniques for training and finetuning large language models.

Framework Comparison

Framework Best For Multi-GPU Memory Efficient
Accelerate Simple distributed Yes Basic
DeepSpeed Large models, ZeRO Yes Excellent
PyTorch Lightning Clean training loops Yes Good
Ray Train Scalable, multi-node Yes Good
TRL RLHF, reward modeling Yes Good
Unsloth Fast LoRA finetuning Limited Excellent

Accelerate (HuggingFace)

Minimal wrapper for distributed training. Run accelerate config for interactive setup.

Key concept: Wrap model, optimizer, dataloader with accelerator.prepare(), use accelerator.backward() for loss.


DeepSpeed (Large Models)

Microsoft's optimization library for training massive models.

ZeRO Stages:

  • Stage 1: Optimizer states partitioned across GPUs
  • Stage 2: + Gradients partitioned
  • Stage 3: + Parameters partitioned (for largest models, 100B+)

Key concept: Configure via JSON, higher stages = more memory savings but more communication overhead.


TRL (RLHF/DPO)

HuggingFace library for reinforcement learning from human feedback.

Training types:

  • SFT (Supervised Finetuning): Standard instruction tuning
  • DPO (Direct Preference Optimization): Simpler than RLHF, uses preference pairs
  • PPO: Classic RLHF with reward model

Key concept: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.


Unsloth (Fast LoRA)

Optimized LoRA finetuning - 2x faster, 60% less memory.

Key concept: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.


Memory Optimization Techniques

Technique Memory Savings Trade-off
Gradient checkpointing ~30-50% Slower training
Mixed precision (fp16/bf16) ~50% Minor precision loss
4-bit quantization (QLoRA) ~75% Some quality loss
Flash Attention ~20-40% Requires compatible GPU
Gradient accumulation Effective batch↑ No memory cost

Decision Guide

Scenario Recommendation
Simple finetuning Accelerate + PEFT
7B-13B models Unsloth (fastest)
70B+ models DeepSpeed ZeRO-3
RLHF/DPO alignment TRL
Multi-node cluster Ray Train
Clean code structure PyTorch Lightning

Resources

Weekly Installs
1
Repository
smithery/ai
First Seen
13 days ago
Installed on
claude-code1