llm-training
LLM Training
Frameworks and techniques for training and finetuning large language models.
Framework Comparison
| Framework | Best For | Multi-GPU | Memory Efficient |
|---|---|---|---|
| Accelerate | Simple distributed | Yes | Basic |
| DeepSpeed | Large models, ZeRO | Yes | Excellent |
| PyTorch Lightning | Clean training loops | Yes | Good |
| Ray Train | Scalable, multi-node | Yes | Good |
| TRL | RLHF, reward modeling | Yes | Good |
| Unsloth | Fast LoRA finetuning | Limited | Excellent |
Accelerate (HuggingFace)
Minimal wrapper for distributed training. Run accelerate config for interactive setup.
Key concept: Wrap model, optimizer, dataloader with accelerator.prepare(), use accelerator.backward() for loss.
DeepSpeed (Large Models)
Microsoft's optimization library for training massive models.
ZeRO Stages:
- Stage 1: Optimizer states partitioned across GPUs
- Stage 2: + Gradients partitioned
- Stage 3: + Parameters partitioned (for largest models, 100B+)
Key concept: Configure via JSON, higher stages = more memory savings but more communication overhead.
TRL (RLHF/DPO)
HuggingFace library for reinforcement learning from human feedback.
Training types:
- SFT (Supervised Finetuning): Standard instruction tuning
- DPO (Direct Preference Optimization): Simpler than RLHF, uses preference pairs
- PPO: Classic RLHF with reward model
Key concept: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.
Unsloth (Fast LoRA)
Optimized LoRA finetuning - 2x faster, 60% less memory.
Key concept: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.
Memory Optimization Techniques
| Technique | Memory Savings | Trade-off |
|---|---|---|
| Gradient checkpointing | ~30-50% | Slower training |
| Mixed precision (fp16/bf16) | ~50% | Minor precision loss |
| 4-bit quantization (QLoRA) | ~75% | Some quality loss |
| Flash Attention | ~20-40% | Requires compatible GPU |
| Gradient accumulation | Effective batch↑ | No memory cost |
Decision Guide
| Scenario | Recommendation |
|---|---|
| Simple finetuning | Accelerate + PEFT |
| 7B-13B models | Unsloth (fastest) |
| 70B+ models | DeepSpeed ZeRO-3 |
| RLHF/DPO alignment | TRL |
| Multi-node cluster | Ray Train |
| Clean code structure | PyTorch Lightning |
Resources
- Accelerate: https://huggingface.co/docs/accelerate
- DeepSpeed: https://www.deepspeed.ai/
- TRL: https://huggingface.co/docs/trl
- Unsloth: https://github.com/unslothai/unsloth