Editing Reinforcement Learning

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Reinforcement Learning (RL) is a paradigm of machine learning in which an agent learns to make decisions by interacting with an environment, receiving feedback in the form of rewards and penalties. Unlike supervised learning, there are no labeled examples — the agent must discover which actions lead to long-term success through trial and error. RL underlies breakthrough systems like AlphaGo, ChatGPT's RLHF fine-tuning, and robotic control.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Agent''' — The learner and decision-maker that interacts with the environment.
* '''Environment''' — Everything the agent interacts with; it receives actions and returns observations and rewards.
* '''State (s)''' — A representation of the current situation of the environment.
* '''Action (a)''' — A choice made by the agent at each time step.
* '''Reward (r)''' — A scalar signal provided by the environment indicating how good or bad an action was.
* '''Policy (π)''' — A mapping from states to actions, defining the agent's behavior.
* '''Value function (V)''' — An estimate of the expected cumulative future reward from a given state when following a policy.
* '''Q-function (Q)''' — An estimate of the expected cumulative reward from taking action a in state s, then following policy π.
* '''Episode''' — A sequence of states, actions, and rewards from an initial state to a terminal state.
* '''Discount factor (γ)''' — A value between 0 and 1 that reduces the weight of future rewards relative to immediate ones.
* '''Exploration vs. exploitation''' — The trade-off between trying new actions (exploration) and repeating known good actions (exploitation).
* '''Markov Decision Process (MDP)''' — The mathematical framework for RL problems, defined by states, actions, transitions, and rewards.
* '''Model-free RL''' — Methods that learn directly from interaction without building an explicit model of the environment.
* '''Model-based RL''' — Methods that learn a model of the environment's dynamics and use it to plan.
* '''PPO (Proximal Policy Optimization)''' — A widely-used policy gradient algorithm known for stability and efficiency.
* '''DQN (Deep Q-Network)''' — A Q-learning algorithm using a neural network to approximate the Q-function.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The core RL loop is: observe state → choose action → receive reward → update policy. The agent's goal is to maximize the expected cumulative discounted reward over time, called the '''return''':

G''t = r''{t+1} + γ·r''{t+2} + γ²·r''{t+3} + ...

The discount factor γ controls how myopic the agent is. γ=0 means only immediate rewards matter; γ→1 means the agent considers the long-term future.

'''The Exploration-Exploitation Dilemma''' is fundamental: if an agent only exploits what it knows, it may miss better strategies. If it only explores, it never uses what it learns. The ε-greedy strategy is a simple solution — with probability ε, take a random action (explore); otherwise, take the best known action (exploit). ε is typically annealed from high to low over training.

'''Policy gradient methods''' directly optimize the policy by adjusting its parameters to increase the probability of actions that led to high returns. Think of it as hill-climbing in policy space.

'''Value-based methods''' learn the Q-function first, then derive the policy as "always take the action with the highest Q-value." DQN famously stabilized this with two innovations: experience replay (sampling random past transitions to break correlation) and a target network (a frozen copy of the Q-network updated periodically).

'''Actor-Critic methods''' combine both: an Actor (policy network) decides actions, while a Critic (value network) evaluates them. The Critic provides a baseline that reduces variance in learning — rather than waiting for sparse, delayed rewards, the agent gets a dense signal from the learned value estimate.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Setting up a basic RL training loop with Gymnasium and Stable-Baselines3:'''

<syntaxhighlight lang="python">
import gymnasium as gym
from stable_baselines3 import PPO

# Create environment
env = gym.make("CartPole-v1")

# Instantiate agent with PPO algorithm
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    verbose=1
)

# Train for 100,000 steps
model.learn(total_timesteps=100_000)

# Evaluate
obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, _ = env.step(action)
    if done or truncated:
        obs, _ = env.reset()
</syntaxhighlight>

; Common RL algorithm selection guide
: '''Discrete actions + simple environments''' → DQN, Double DQN
: '''Continuous control (robotics, locomotion)''' → SAC (Soft Actor-Critic), TD3
: '''General purpose, on-policy''' → PPO, A3C
: '''RLHF for LLMs''' → PPO with KL divergence penalty from reference model
: '''Game environments''' → AlphaZero-style MCTS + RL for perfect-information games
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ RL Algorithm Comparison
! Algorithm !! On/Off Policy !! Action Space !! Sample Efficiency !! Stability
|-
| DQN || Off || Discrete || Medium || Moderate
|-
| PPO || On || Both || Low || High
|-
| SAC || Off || Continuous || High || High
|-
| TD3 || Off || Continuous || High || High
|-
| A3C || On || Both || Low || Moderate
|}

'''Failure modes and pitfalls:'''
* '''Reward hacking''' — The agent finds unintended ways to maximize the reward signal that violate the spirit of the task. Example: a boat-racing agent learned to spin in circles collecting bonuses rather than completing the race.
* '''Sparse rewards''' — If reward is only given at episode completion, learning is extremely slow. Mitigate with reward shaping, curriculum learning, or intrinsic motivation (curiosity).
* '''Sample inefficiency''' — Model-free RL requires enormous amounts of interaction data. AlphaGo needed millions of self-play games. Real-world robots can't afford this — use simulation or model-based approaches.
* '''Catastrophic forgetting''' — As the agent improves, early experiences become less representative. Experience replay buffers and periodic re-evaluation mitigate this.
* '''Distribution shift''' — The policy changes during training, meaning the data collected under an old policy becomes stale for the new policy.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Experts evaluate RL systems along dimensions that casual practitioners often overlook:

'''Sample efficiency vs. wall-clock time''': The number of environment interactions required to reach a target performance level. A method that converges in 1M steps may be preferred over one that converges in 500k if the latter requires a larger compute budget per step.

'''Stability and reproducibility''': RL training is notoriously sensitive to random seeds, hyperparameters, and implementation details. Expert-level evaluation runs multiple seeds and reports mean ± standard deviation, not just the best run.

'''Policy interpretability''': For safety-critical applications, can you explain why the agent takes a given action? Experts use visualization, attention maps, or mechanistic analysis to build trust.

'''Transfer and generalization''': Does the policy hold up in environments slightly different from training? Evaluate on held-out environment variants. Domain randomization during training is a key technique for robustness.

A common expert mistake is '''Goodhart's Law''' in reward design — "When a measure becomes a target, it ceases to be a good measure." The reward specification must be treated as rigorously as any other design document.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing an RL system from scratch requires careful specification of each MDP component:

'''1. State space design'''
* What information does the agent need to make good decisions?
* Raw pixels vs. hand-crafted features vs. learned embeddings?
* Partial observability? → Use recurrent policy (LSTM-based) or frame stacking

'''2. Action space design'''
* Discrete (choose from N options) vs. continuous (output a vector)?
* Action parameterization matters: joint angles vs. end-effector positions in robotics

'''3. Reward function design'''
* Dense vs. sparse: prefer dense when possible
* Normalize rewards to [-1, 1] or [0, 1]
* Potential-based shaping guarantees policy invariance

'''4. Pipeline architecture for RLHF:'''
<syntaxhighlight lang="text">
[Human Preference Data]
        ↓
[Train Reward Model (RM)]
        ↓
[Freeze RM] + [Reference LLM (frozen)]
        ↓
[PPO: Fine-tune LLM with RM signal + KL penalty]
        ↓
[Evaluated LLM Policy]
</syntaxhighlight>

'''5. Infrastructure checklist'''
* Vectorized environments (run N envs in parallel) for throughput
* GPU for policy/value network updates
* Logging: episode return, episode length, entropy, value loss, policy loss
* Curriculum: start with easier versions of the task, increase difficulty as agent improves

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Reinforcement Learning]]
</div>