Editing Reinforcement Learning (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
The core RL loop is: observe state → choose action → receive reward → update policy. The agent's goal is to maximize the expected cumulative discounted reward over time, called the '''return''':

G''t = r''{t+1} + γ·r''{t+2} + γ²·r''{t+3} + ...

The discount factor γ controls how myopic the agent is. γ=0 means only immediate rewards matter; γ→1 means the agent considers the long-term future.

'''The Exploration-Exploitation Dilemma''' is fundamental: if an agent only exploits what it knows, it may miss better strategies. If it only explores, it never uses what it learns. The ε-greedy strategy is a simple solution — with probability ε, take a random action (explore); otherwise, take the best known action (exploit). ε is typically annealed from high to low over training.

'''Policy gradient methods''' directly optimize the policy by adjusting its parameters to increase the probability of actions that led to high returns. Think of it as hill-climbing in policy space.

'''Value-based methods''' learn the Q-function first, then derive the policy as "always take the action with the highest Q-value." DQN famously stabilized this with two innovations: experience replay (sampling random past transitions to break correlation) and a target network (a frozen copy of the Q-network updated periodically).

'''Actor-Critic methods''' combine both: an Actor (policy network) decides actions, while a Critic (value network) evaluates them. The Critic provides a baseline that reduces variance in learning — rather than waiting for sparse, delayed rewards, the agent gets a dense signal from the learned value estimate.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">