reinforcement-learning
Reinforcement Learning Best Practices
Overview
This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.
When to Use
- Building RL agents for discrete or continuous control tasks
- Creating custom simulation environments
- Tuning hyperparameters for RL algorithms
- Debugging training issues (reward curves, policy collapse, numerical instability)
- Deploying trained policies to production
Library Selection
| Library | Best For | Ease | Flexibility | Production |
|---|---|---|---|---|
| Stable-Baselines3 | Prototyping, learning | High | Medium | Good |
| RLlib | Production, distributed | Medium | High | Excellent |
| CleanRL | Research, understanding | High | Low | Poor |
| TorchRL | Custom implementations | Low | Highest | Good |
Algorithm Decision Tree
Start
|
v
Action space type?
|
+-- Discrete --> Sample efficiency critical?
| |
| +-- Yes --> DQN (or Double/Dueling DQN)
| +-- No --> Stability critical?
| |
| +-- Yes --> PPO
| +-- No --> A2C (faster iterations)
|
+-- Continuous --> Sample efficiency critical?
|
+-- Yes --> SAC (auto entropy) or TD3
+-- No --> PPO (more stable, less efficient)
Quick Selection Table:
| Scenario | Recommended | Why |
|---|---|---|
| Discrete actions, getting started | PPO | Stable, good defaults |
| Continuous control | SAC or TD3 | Sample efficient, handles continuous well |
| Sample efficiency critical | SAC, DQN | Off-policy, reuses experience |
| Stability critical | PPO | Trust region, consistent |
| High-dimensional obs (images) | PPO + CNN | Handles visual input well |
| Fast iteration needed | A2C | Simpler, faster per update |
Quick Start with Stable-Baselines3
Basic Training
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Create vectorized environment (4 parallel envs)
env = make_vec_env("CartPole-v1", n_envs=4)
# Initialize and train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)
# Save and load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")
# Evaluate
obs = env.reset()
for _ in range(1000):
action, _ = loaded_model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
Custom Environment Template
import gymnasium as gym
from gymnasium import spaces
import numpy as np
class CustomEnv(gym.Env):
metadata = {"render_modes": ["human", "rgb_array"]}
def __init__(self, render_mode=None):
super().__init__()
self.observation_space = spaces.Box(
low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
)
self.action_space = spaces.Discrete(2)
self.render_mode = render_mode
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
return self.state.astype(np.float32), {}
def step(self, action):
# Implement environment dynamics here
observation = self.state.astype(np.float32)
reward = 1.0
terminated = False # Episode ended due to task completion/failure
truncated = False # Episode ended due to time limit
info = {}
return observation, reward, terminated, truncated, info
def render(self):
pass
Hyperparameter Tuning with Optuna
import optuna
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
def objective(trial):
learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
n_steps = trial.suggest_categorical("n_steps", [256, 512, 1024, 2048])
gamma = trial.suggest_float("gamma", 0.9, 0.9999)
model = PPO(
"MlpPolicy", "CartPole-v1",
learning_rate=learning_rate,
n_steps=n_steps,
gamma=gamma,
verbose=0
)
model.learn(total_timesteps=50_000)
mean_reward, _ = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
return mean_reward
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
print(f"Best params: {study.best_params}")
Core Workflow
- Define the environment - Use Gymnasium API, validate spaces
- Select algorithm - Based on action space and requirements
- Start simple - Default hyperparameters, short training
- Monitor training - TensorBoard, check reward curves
- Debug issues - Use the debugging playbook
- Tune hyperparameters - Optuna for systematic search
- Evaluate properly - Separate eval env, multiple seeds
- Deploy - Export to ONNX/TorchScript
Reference Files
- algorithms.md - Deep dive on DQN, PPO, SAC, A2C, TD3
- environments.md - Gymnasium setup, custom envs, wrappers
- training.md - Hyperparameters, reward engineering, normalization
- debugging.md - Failure modes, diagnostics, sanity checks
- evaluation.md - Metrics, logging, reproducibility
- deployment.md - ONNX export, inference optimization, safety
Essential Dependencies
pip install gymnasium stable-baselines3 tensorboard optuna
# For Atari environments
pip install gymnasium[atari] gymnasium[accept-rom-license]
# For MuJoCo
pip install gymnasium[mujoco]
Common Pitfalls to Avoid
- Not normalizing observations - Use
VecNormalizewrapper - Wrong action space handling - Check discrete vs continuous
- Ignoring seed management - Set seeds for reproducibility
- Training and eval on same env - Use separate eval environment
- Not monitoring entropy - Low entropy = policy collapse
- Sparse rewards without shaping - Add intermediate rewards
- Too large/small learning rate - Start with 3e-4 for most algorithms
More from aznatkoiny/zai-skills
consulting-frameworks
>
98real-estate-investment
>
12x402-payments
|
11cpp-reinforcement-learning
|
9deep-learning
Comprehensive guide for Deep Learning with Keras 3 (Multi-Backend: JAX, TensorFlow, PyTorch). Use when building neural networks, CNNs for computer vision, RNNs/Transformers for NLP, time series forecasting, or generative models (VAEs, GANs). Covers model building (Sequential/Functional/Subclassing APIs), custom training loops, data augmentation, transfer learning, and production best practices.
8prompt-optimizer
Optimize prompts for Claude 4.x models using Anthropic's official best practices. Use when users want to improve, refine, or create effective prompts for Claude. Triggers include requests to optimize prompts, make prompts more effective, fix underperforming prompts, create system prompts, improve instruction following, reduce verbosity, control output formatting, or enhance agentic/tool-use behaviors. Also use when users report Claude is being too verbose, not following instructions, not using tools properly, or producing generic outputs.
8