trl-training
TRL Training Skill
You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
Overview
TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:
- SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
- DPO (Direct Preference Optimization): Align models using preference data
- GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
- RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
- Reward Model Training: Train reward models for RLHF
TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
Core Commands
trl sft - Supervised Fine-Tuning
Fine-tune language models on instruction-following or conversational datasets.
Full training:
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
Train with LoRA adapters:
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
trl dpo - Direct Preference Optimization
Align models using preference data (chosen/rejected pairs).
Full training:
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-7 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns
Train with LoRA adapters:
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-6 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns \
--use_peft \
--lora_r 32 \
--lora_alpha 16
trl grpo - Group Relative Policy Optimization
Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.
Basic usage:
trl grpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/gsm8k \
--reward_funcs accuracy_reward \
--output_dir Qwen2-0.5B-GRPO \
--push_to_hub
trl rloo - Reinforce Leave One Out
Online RL training where the model generates text and receives rewards based on custom criteria.
Basic usage:
trl rloo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/tldr \
--reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
--output_dir Qwen2-0.5B-RLOO \
--push_to_hub
trl reward - Reward Model Training
Train a reward model to score text quality for RLHF.
Full training:
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-5 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048
Train with LoRA adapters:
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-4 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_task_type SEQ_CLS \
--lora_r 32 \
--lora_alpha 16
Configuration Files
TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.
Example config (sft_config.yaml):
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio
Launch with config:
trl sft --config sft_config.yaml
Override config values:
trl sft --config sft_config.yaml --learning_rate 1.0e-5
Distributed Training
TRL integrates with Accelerate for multi-GPU and multi-node training.
Multi-GPU training:
trl sft \
--config sft_config.yaml \
--num_processes 4
Use predefined Accelerate configs:
TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3
trl sft \
--config sft_config.yaml \
--accelerate_config zero2
Custom Accelerate config:
# Generate custom config
accelerate config
# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
Fully Sharded Data Parallel (FSDP):
trl sft --config sft_config.yaml --accelerate_config fsdp2
DeepSpeed ZeRO:
trl sft --config sft_config.yaml --accelerate_config zero3
Troubleshooting
CUDA Out of Memory
- Reduce
--per_device_train_batch_sizeand increase--gradient_accumulation_steps - Enable
--use_peftfor LoRA training - Use
--gradient_checkpointingto save memory - Try smaller model or longer sequence truncation
Dataset Loading Issues
- Verify dataset exists: check Hugging Face Hub or local path
- Check dataset format matches expected columns
- Use
--dataset_configfor multi-config datasets - Inspect dataset:
from datasets import load_dataset; ds = load_dataset(name)
Model Loading Issues
- Verify model exists on Hugging Face Hub
- Check if gated model requires authentication:
hf auth login - For local models, provide absolute path
- Ensure sufficient disk space and memory
Slow Training
- Enable dataset
--packingfor short sequences - Use larger
--per_device_train_batch_sizeif memory allows - Enable
--tf32for faster computation on Ampere GPUs - Use
--bf16on supported hardware - Consider multi-GPU training with
--num_processes
Generation Issues (GRPO/RLOO)
- Check prompt format in dataset
- Adjust
--temperatureand--top_pfor generation - Verify the reward function (for GRPO/RLOO)
Additional Resources
- Documentation: https://huggingface.co/docs/trl
- GitHub: https://github.com/huggingface/trl
- Examples: https://github.com/huggingface/trl/tree/main/examples
Best Practices
- Start with SFT: Always fine-tune base models with SFT before preference alignment
- Use LoRA for efficiency: Enable
--use_peftfor faster training and lower memory - Monitor training: Use
--report_to trackio(or--report_to wandbor--report_to tensorboard) for tracking - Save checkpoints: TRL automatically saves checkpoints in
--output_dir - Test on small datasets first: Verify pipeline works before full training
- Use configuration files: Create YAML configs for reproducibility
- Leverage Accelerate: Use multi-GPU training for faster iteration
When helping users with TRL:
- Always check which training method is appropriate for their use case
- Verify dataset format matches the expected schema
- Recommend starting with smaller models for testing
- Suggest LoRA for resource-constrained environments
- Point to specific documentation sections for advanced features