Editing Reinforcement Learning (section)

== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ RL Algorithm Comparison
! Algorithm !! On/Off Policy !! Action Space !! Sample Efficiency !! Stability
|-
| DQN || Off || Discrete || Medium || Moderate
|-
| PPO || On || Both || Low || High
|-
| SAC || Off || Continuous || High || High
|-
| TD3 || Off || Continuous || High || High
|-
| A3C || On || Both || Low || Moderate
|}

'''Failure modes and pitfalls:'''
* '''Reward hacking''' — The agent finds unintended ways to maximize the reward signal that violate the spirit of the task. Example: a boat-racing agent learned to spin in circles collecting bonuses rather than completing the race.
* '''Sparse rewards''' — If reward is only given at episode completion, learning is extremely slow. Mitigate with reward shaping, curriculum learning, or intrinsic motivation (curiosity).
* '''Sample inefficiency''' — Model-free RL requires enormous amounts of interaction data. AlphaGo needed millions of self-play games. Real-world robots can't afford this — use simulation or model-based approaches.
* '''Catastrophic forgetting''' — As the agent improves, early experiences become less representative. Experience replay buffers and periodic re-evaluation mitigate this.
* '''Distribution shift''' — The policy changes during training, meaning the data collected under an old policy becomes stale for the new policy.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">