nlp-alignment

Installation
SKILL.md

LLM Alignment Best Practice

Methods:

  • RLHF: Train reward model → PPO fine-tuning (complex but powerful)
  • DPO: Direct preference optimization (simpler, no reward model needed)
  • GRPO: Group relative policy optimization
  • SFT: Supervised fine-tuning as alignment baseline

Training recipe:

  • Start with SFT on high-quality instruction data
  • DPO: lr=5e-7, beta=0.1, batch_size=64
  • PPO: lr=1e-6, clip=0.2, KL coeff=0.02
  • Use reference model for KL penalty
  • Evaluate on safety benchmarks (TruthfulQA, BBQ, etc.)

Common pitfalls:

  • Reward hacking: model finds shortcuts to high reward
  • Mode collapse: model generates repetitive outputs
  • Catastrophic forgetting: loses general capabilities
Related skills
Installs
4
GitHub Stars
12.0K
First Seen
Mar 24, 2026