Reinforcement Learning

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Reinforcement Learning (RL) is a paradigm of machine learning in which an agent learns to make decisions by interacting with an environment, receiving feedback in the form of rewards and penalties. Unlike supervised learning, there are no labeled examples — the agent must discover which actions lead to long-term success through trial and error. RL underlies breakthrough systems like AlphaGo, ChatGPT's RLHF fine-tuning, and robotic control.

Remembering

Agent — The learner and decision-maker that interacts with the environment.
Environment — Everything the agent interacts with; it receives actions and returns observations and rewards.
State (s) — A representation of the current situation of the environment.
Action (a) — A choice made by the agent at each time step.
Reward (r) — A scalar signal provided by the environment indicating how good or bad an action was.
Policy (π) — A mapping from states to actions, defining the agent's behavior.
Value function (V) — An estimate of the expected cumulative future reward from a given state when following a policy.
Q-function (Q) — An estimate of the expected cumulative reward from taking action a in state s, then following policy π.
Episode — A sequence of states, actions, and rewards from an initial state to a terminal state.
Discount factor (γ) — A value between 0 and 1 that reduces the weight of future rewards relative to immediate ones.
Exploration vs. exploitation — The trade-off between trying new actions (exploration) and repeating known good actions (exploitation).
Markov Decision Process (MDP) — The mathematical framework for RL problems, defined by states, actions, transitions, and rewards.
Model-free RL — Methods that learn directly from interaction without building an explicit model of the environment.
Model-based RL — Methods that learn a model of the environment's dynamics and use it to plan.
PPO (Proximal Policy Optimization) — A widely-used policy gradient algorithm known for stability and efficiency.
DQN (Deep Q-Network) — A Q-learning algorithm using a neural network to approximate the Q-function.

Understanding

The core RL loop is: observe state → choose action → receive reward → update policy. The agent's goal is to maximize the expected cumulative discounted reward over time, called the return:

Gt = r{t+1} + γ·r{t+2} + γ²·r{t+3} + ...

The discount factor γ controls how myopic the agent is. γ=0 means only immediate rewards matter; γ→1 means the agent considers the long-term future.

The Exploration-Exploitation Dilemma is fundamental: if an agent only exploits what it knows, it may miss better strategies. If it only explores, it never uses what it learns. The ε-greedy strategy is a simple solution — with probability ε, take a random action (explore); otherwise, take the best known action (exploit). ε is typically annealed from high to low over training.

Policy gradient methods directly optimize the policy by adjusting its parameters to increase the probability of actions that led to high returns. Think of it as hill-climbing in policy space.

Value-based methods learn the Q-function first, then derive the policy as "always take the action with the highest Q-value." DQN famously stabilized this with two innovations: experience replay (sampling random past transitions to break correlation) and a target network (a frozen copy of the Q-network updated periodically).

Actor-Critic methods combine both: an Actor (policy network) decides actions, while a Critic (value network) evaluates them. The Critic provides a baseline that reduces variance in learning — rather than waiting for sparse, delayed rewards, the agent gets a dense signal from the learned value estimate.

Applying

Setting up a basic RL training loop with Gymnasium and Stable-Baselines3:

<syntaxhighlight lang="python"> import gymnasium as gym from stable_baselines3 import PPO

Create environment

env = gym.make("CartPole-v1")

Instantiate agent with PPO algorithm

model = PPO(

   "MlpPolicy",
   env,
   learning_rate=3e-4,
   n_steps=2048,
   batch_size=64,
   n_epochs=10,
   gamma=0.99,
   verbose=1

)

Train for 100,000 steps

model.learn(total_timesteps=100_000)

Evaluate

obs, _ = env.reset() for _ in range(1000):

   action, _ = model.predict(obs, deterministic=True)
   obs, reward, done, truncated, _ = env.step(action)
   if done or truncated:
       obs, _ = env.reset()

</syntaxhighlight>

Common RL algorithm selection guide: Discrete actions + simple environments → DQN, Double DQN; Continuous control (robotics, locomotion) → SAC (Soft Actor-Critic), TD3; General purpose, on-policy → PPO, A3C; RLHF for LLMs → PPO with KL divergence penalty from reference model; Game environments → AlphaZero-style MCTS + RL for perfect-information games

Analyzing

RL Algorithm Comparison
Algorithm	On/Off Policy	Action Space	Sample Efficiency	Stability
DQN	Off	Discrete	Medium	Moderate
PPO	On	Both	Low	High
SAC	Off	Continuous	High	High
TD3	Off	Continuous	High	High
A3C	On	Both	Low	Moderate

Failure modes and pitfalls:

Reward hacking — The agent finds unintended ways to maximize the reward signal that violate the spirit of the task. Example: a boat-racing agent learned to spin in circles collecting bonuses rather than completing the race.
Sparse rewards — If reward is only given at episode completion, learning is extremely slow. Mitigate with reward shaping, curriculum learning, or intrinsic motivation (curiosity).
Sample inefficiency — Model-free RL requires enormous amounts of interaction data. AlphaGo needed millions of self-play games. Real-world robots can't afford this — use simulation or model-based approaches.
Catastrophic forgetting — As the agent improves, early experiences become less representative. Experience replay buffers and periodic re-evaluation mitigate this.
Distribution shift — The policy changes during training, meaning the data collected under an old policy becomes stale for the new policy.

Evaluating

Experts evaluate RL systems along dimensions that casual practitioners often overlook:

Sample efficiency vs. wall-clock time: The number of environment interactions required to reach a target performance level. A method that converges in 1M steps may be preferred over one that converges in 500k if the latter requires a larger compute budget per step.

Stability and reproducibility: RL training is notoriously sensitive to random seeds, hyperparameters, and implementation details. Expert-level evaluation runs multiple seeds and reports mean ± standard deviation, not just the best run.

Policy interpretability: For safety-critical applications, can you explain why the agent takes a given action? Experts use visualization, attention maps, or mechanistic analysis to build trust.

Transfer and generalization: Does the policy hold up in environments slightly different from training? Evaluate on held-out environment variants. Domain randomization during training is a key technique for robustness.

A common expert mistake is Goodhart's Law in reward design — "When a measure becomes a target, it ceases to be a good measure." The reward specification must be treated as rigorously as any other design document.

Creating

Designing an RL system from scratch requires careful specification of each MDP component:

1. State space design

What information does the agent need to make good decisions?
Raw pixels vs. hand-crafted features vs. learned embeddings?
Partial observability? → Use recurrent policy (LSTM-based) or frame stacking

2. Action space design

Discrete (choose from N options) vs. continuous (output a vector)?
Action parameterization matters: joint angles vs. end-effector positions in robotics

3. Reward function design

Dense vs. sparse: prefer dense when possible
Normalize rewards to [-1, 1] or [0, 1]
Potential-based shaping guarantees policy invariance

4. Pipeline architecture for RLHF: <syntaxhighlight lang="text"> [Human Preference Data]

↓

[Train Reward Model (RM)]

↓

[Freeze RM] + [Reference LLM (frozen)]

↓

[PPO: Fine-tune LLM with RM signal + KL penalty]

↓

[Evaluated LLM Policy] </syntaxhighlight>

5. Infrastructure checklist

Vectorized environments (run N envs in parallel) for throughput
GPU for policy/value network updates
Logging: episode return, episode length, entropy, value loss, policy loss
Curriculum: start with easier versions of the task, increase difficulty as agent improves

Reinforcement Learning

Contents

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Reinforcement Learning

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Search