skills/huggingface/trl/trl-training

trl-training

Installation
SKILL.md

TRL Training Skill

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.

Overview

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:

  • SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
  • DPO (Direct Preference Optimization): Align models using preference data
  • GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
  • RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
  • Reward Model Training: Train reward models for RLHF

TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.

Core Commands

trl sft - Supervised Fine-Tuning

Fine-tune language models on instruction-following or conversational datasets.

Full training:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

Train with LoRA adapters:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

trl dpo - Direct Preference Optimization

Align models using preference data (chosen/rejected pairs).

Full training:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns

Train with LoRA adapters:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16

trl grpo - Group Relative Policy Optimization

Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.

Basic usage:

trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub

trl rloo - Reinforce Leave One Out

Online RL training where the model generates text and receives rewards based on custom criteria.

Basic usage:

trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub

trl reward - Reward Model Training

Train a reward model to score text quality for RLHF.

Full training:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048

Train with LoRA adapters:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16

Configuration Files

TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.

Example config (sft_config.yaml):

model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio

Launch with config:

trl sft --config sft_config.yaml

Override config values:

trl sft --config sft_config.yaml --learning_rate 1.0e-5

Distributed Training

TRL integrates with Accelerate for multi-GPU and multi-node training.

Multi-GPU training:

trl sft \
  --config sft_config.yaml \
  --num_processes 4

Use predefined Accelerate configs:

TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3

trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2

Custom Accelerate config:

# Generate custom config
accelerate config

# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml

Fully Sharded Data Parallel (FSDP):

trl sft --config sft_config.yaml --accelerate_config fsdp2

DeepSpeed ZeRO:

trl sft --config sft_config.yaml --accelerate_config zero3

Troubleshooting

CUDA Out of Memory

  • Reduce --per_device_train_batch_size and increase --gradient_accumulation_steps
  • Enable --use_peft for LoRA training
  • Use --gradient_checkpointing to save memory
  • Try smaller model or longer sequence truncation

Dataset Loading Issues

  • Verify dataset exists: check Hugging Face Hub or local path
  • Check dataset format matches expected columns
  • Use --dataset_config for multi-config datasets
  • Inspect dataset: from datasets import load_dataset; ds = load_dataset(name)

Model Loading Issues

  • Verify model exists on Hugging Face Hub
  • Check if gated model requires authentication: hf auth login
  • For local models, provide absolute path
  • Ensure sufficient disk space and memory

Slow Training

  • Enable dataset --packing for short sequences
  • Use larger --per_device_train_batch_size if memory allows
  • Enable --tf32 for faster computation on Ampere GPUs
  • Use --bf16 on supported hardware
  • Consider multi-GPU training with --num_processes

Generation Issues (GRPO/RLOO)

  • Check prompt format in dataset
  • Adjust --temperature and --top_p for generation
  • Verify the reward function (for GRPO/RLOO)

Additional Resources

Best Practices

  1. Start with SFT: Always fine-tune base models with SFT before preference alignment
  2. Use LoRA for efficiency: Enable --use_peft for faster training and lower memory
  3. Monitor training: Use --report_to trackio (or --report_to wandb or --report_to tensorboard) for tracking
  4. Save checkpoints: TRL automatically saves checkpoints in --output_dir
  5. Test on small datasets first: Verify pipeline works before full training
  6. Use configuration files: Create YAML configs for reproducibility
  7. Leverage Accelerate: Use multi-GPU training for faster iteration

When helping users with TRL:

  • Always check which training method is appropriate for their use case
  • Verify dataset format matches the expected schema
  • Recommend starting with smaller models for testing
  • Suggest LoRA for resource-constrained environments
  • Point to specific documentation sections for advanced features
Weekly Installs
1
Repository
huggingface/trl
GitHub Stars
18.0K
First Seen
Apr 15, 2026