Reinforcement Learning Best Practices

Overview

This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.

When to Use

Building RL agents for discrete or continuous control tasks
Creating custom simulation environments
Tuning hyperparameters for RL algorithms
Debugging training issues (reward curves, policy collapse, numerical instability)
Deploying trained policies to production

Library Selection

Library	Best For	Ease	Flexibility	Production
Stable-Baselines3	Prototyping, learning	High	Medium	Good
RLlib	Production, distributed	Medium	High	Excellent
CleanRL	Research, understanding	High	Low	Poor
TorchRL	Custom implementations	Low	Highest	Good

Algorithm Decision Tree

Start
  |
  v
Action space type?
  |
  +-- Discrete --> Sample efficiency critical?
  |                  |
  |                  +-- Yes --> DQN (or Double/Dueling DQN)
  |                  +-- No  --> Stability critical?
  |                               |
  |                               +-- Yes --> PPO
  |                               +-- No  --> A2C (faster iterations)
  |
  +-- Continuous --> Sample efficiency critical?
                       |
                       +-- Yes --> SAC (auto entropy) or TD3
                       +-- No  --> PPO (more stable, less efficient)

Quick Selection Table:

Scenario	Recommended	Why
Discrete actions, getting started	PPO	Stable, good defaults
Continuous control	SAC or TD3	Sample efficient, handles continuous well
Sample efficiency critical	SAC, DQN	Off-policy, reuses experience
Stability critical	PPO	Trust region, consistent
High-dimensional obs (images)	PPO + CNN	Handles visual input well
Fast iteration needed	A2C	Simpler, faster per update

Quick Start with Stable-Baselines3

Basic Training

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create vectorized environment (4 parallel envs)
env = make_vec_env("CartPole-v1", n_envs=4)

# Initialize and train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)

# Save and load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = loaded_model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)

Custom Environment Template

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class CustomEnv(gym.Env):
    metadata = {"render_modes": ["human", "rgb_array"]}

    def __init__(self, render_mode=None):
        super().__init__()
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
        )
        self.action_space = spaces.Discrete(2)
        self.render_mode = render_mode

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
        return self.state.astype(np.float32), {}

    def step(self, action):
        # Implement environment dynamics here
        observation = self.state.astype(np.float32)
        reward = 1.0
        terminated = False  # Episode ended due to task completion/failure
        truncated = False   # Episode ended due to time limit
        info = {}
        return observation, reward, terminated, truncated, info

    def render(self):
        pass

Hyperparameter Tuning with Optuna

import optuna
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
    n_steps = trial.suggest_categorical("n_steps", [256, 512, 1024, 2048])
    gamma = trial.suggest_float("gamma", 0.9, 0.9999)

    model = PPO(
        "MlpPolicy", "CartPole-v1",
        learning_rate=learning_rate,
        n_steps=n_steps,
        gamma=gamma,
        verbose=0
    )
    model.learn(total_timesteps=50_000)

    mean_reward, _ = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
    return mean_reward

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
print(f"Best params: {study.best_params}")

Core Workflow

Define the environment - Use Gymnasium API, validate spaces
Select algorithm - Based on action space and requirements
Start simple - Default hyperparameters, short training
Monitor training - TensorBoard, check reward curves
Debug issues - Use the debugging playbook
Tune hyperparameters - Optuna for systematic search
Evaluate properly - Separate eval env, multiple seeds
Deploy - Export to ONNX/TorchScript

Reference Files

algorithms.md - Deep dive on DQN, PPO, SAC, A2C, TD3
environments.md - Gymnasium setup, custom envs, wrappers
training.md - Hyperparameters, reward engineering, normalization
debugging.md - Failure modes, diagnostics, sanity checks
evaluation.md - Metrics, logging, reproducibility
deployment.md - ONNX export, inference optimization, safety

Essential Dependencies

pip install gymnasium stable-baselines3 tensorboard optuna
# For Atari environments
pip install gymnasium[atari] gymnasium[accept-rom-license]
# For MuJoCo
pip install gymnasium[mujoco]

Common Pitfalls to Avoid

Not normalizing observations - Use VecNormalize wrapper
Wrong action space handling - Check discrete vs continuous
Ignoring seed management - Set seeds for reproducibility
Training and eval on same env - Use separate eval environment
Not monitoring entropy - Low entropy = policy collapse
Sparse rewards without shaping - Add intermediate rewards
Too large/small learning rate - Start with 3e-4 for most algorithms

reinforcement-learning

Reinforcement Learning Best Practices

Overview

When to Use

Library Selection

Algorithm Decision Tree

Quick Start with Stable-Baselines3

Basic Training

Custom Environment Template

Hyperparameter Tuning with Optuna

Core Workflow

Reference Files

Essential Dependencies

Common Pitfalls to Avoid

More from aznatkoiny/zai-skills

consulting-frameworks

real-estate-investment

x402-payments

cpp-reinforcement-learning

deep-learning

prompt-optimizer