TRL Training Skill

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.

Overview

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:

SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
DPO (Direct Preference Optimization): Align models using preference data
GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
Reward Model Training: Train reward models for RLHF

TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.

Core Commands

trl sft - Supervised Fine-Tuning

Fine-tune language models on instruction-following or conversational datasets.

Full training:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

Train with LoRA adapters:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

trl dpo - Direct Preference Optimization

Align models using preference data (chosen/rejected pairs).

Full training:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns

Train with LoRA adapters:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16

trl grpo - Group Relative Policy Optimization

Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.

Basic usage:

trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub

trl rloo - Reinforce Leave One Out

Online RL training where the model generates text and receives rewards based on custom criteria.

Basic usage:

trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub

trl reward - Reward Model Training

Train a reward model to score text quality for RLHF.

Full training:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048

Train with LoRA adapters:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16

Configuration Files

TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.

Example config (sft_config.yaml):

model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio

Launch with config:

trl sft --config sft_config.yaml

Override config values:

trl sft --config sft_config.yaml --learning_rate 1.0e-5

Distributed Training

TRL integrates with Accelerate for multi-GPU and multi-node training.

Multi-GPU training:

trl sft \
  --config sft_config.yaml \
  --num_processes 4

Use predefined Accelerate configs:

TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3

trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2

Custom Accelerate config:

# Generate custom config
accelerate config

# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml

Fully Sharded Data Parallel (FSDP):

trl sft --config sft_config.yaml --accelerate_config fsdp2

DeepSpeed ZeRO:

trl sft --config sft_config.yaml --accelerate_config zero3

Troubleshooting

CUDA Out of Memory

Reduce --per_device_train_batch_size and increase --gradient_accumulation_steps
Enable --use_peft for LoRA training
Use --gradient_checkpointing to save memory
Try smaller model or longer sequence truncation

Dataset Loading Issues

Verify dataset exists: check Hugging Face Hub or local path
Check dataset format matches expected columns
Use --dataset_config for multi-config datasets
Inspect dataset: from datasets import load_dataset; ds = load_dataset(name)

Model Loading Issues

Verify model exists on Hugging Face Hub
Check if gated model requires authentication: hf auth login
For local models, provide absolute path
Ensure sufficient disk space and memory

Slow Training

Enable dataset --packing for short sequences
Use larger --per_device_train_batch_size if memory allows
Enable --tf32 for faster computation on Ampere GPUs
Use --bf16 on supported hardware
Consider multi-GPU training with --num_processes

Generation Issues (GRPO/RLOO)

Check prompt format in dataset
Adjust --temperature and --top_p for generation
Verify the reward function (for GRPO/RLOO)

Additional Resources

Documentation: https://huggingface.co/docs/trl
GitHub: https://github.com/huggingface/trl
Examples: https://github.com/huggingface/trl/tree/main/examples

Best Practices

Start with SFT: Always fine-tune base models with SFT before preference alignment
Use LoRA for efficiency: Enable --use_peft for faster training and lower memory
Monitor training: Use --report_to trackio (or --report_to wandb or --report_to tensorboard) for tracking
Save checkpoints: TRL automatically saves checkpoints in --output_dir
Test on small datasets first: Verify pipeline works before full training
Use configuration files: Create YAML configs for reproducibility
Leverage Accelerate: Use multi-GPU training for faster iteration

When helping users with TRL:

Always check which training method is appropriate for their use case
Verify dataset format matches the expected schema
Recommend starting with smaller models for testing
Suggest LoRA for resource-constrained environments
Point to specific documentation sections for advanced features

trl-training

TRL Training Skill

Overview

Core Commands

trl sft - Supervised Fine-Tuning

trl dpo - Direct Preference Optimization

trl grpo - Group Relative Policy Optimization

trl rloo - Reinforce Leave One Out

trl reward - Reward Model Training

Configuration Files

Distributed Training

Troubleshooting

CUDA Out of Memory

Dataset Loading Issues

Model Loading Issues

Slow Training

Generation Issues (GRPO/RLOO)

Additional Resources

Best Practices