rl-policy-optimization

Installation
SKILL.md

RL Policy Optimization Best Practice

Algorithm selection:

  • Discrete actions: PPO, DQN, A2C
  • Continuous actions: SAC, TD3, PPO
  • Multi-agent: MAPPO, QMIX
  • Offline: CQL, IQL, Decision Transformer

Training recipe:

  • PPO: clip=0.2, lr=3e-4, gamma=0.99, GAE lambda=0.95
  • SAC: lr=3e-4, tau=0.005, auto-tune alpha
  • Use vectorized environments (e.g., gymnasium.vector)
  • Normalize observations and rewards
  • Log episode return, episode length, value loss, policy entropy

Evaluation:

  • Report mean +/- std over 10+ evaluation episodes
  • Use deterministic policy for evaluation
  • Compare against random policy and simple baselines
  • Report sample efficiency (return vs. env steps)

Common pitfalls:

  • Reward shaping can introduce bias
  • Seed sensitivity is HIGH — use 5+ seeds
  • Hyperparameter sensitivity — do a small sweep
Related skills
Installs
4
GitHub Stars
12.0K
First Seen
Mar 24, 2026