Editing Reinforcement Learning (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Agent''' — The learner and decision-maker that interacts with the environment.
* '''Environment''' — Everything the agent interacts with; it receives actions and returns observations and rewards.
* '''State (s)''' — A representation of the current situation of the environment.
* '''Action (a)''' — A choice made by the agent at each time step.
* '''Reward (r)''' — A scalar signal provided by the environment indicating how good or bad an action was.
* '''Policy (π)''' — A mapping from states to actions, defining the agent's behavior.
* '''Value function (V)''' — An estimate of the expected cumulative future reward from a given state when following a policy.
* '''Q-function (Q)''' — An estimate of the expected cumulative reward from taking action a in state s, then following policy π.
* '''Episode''' — A sequence of states, actions, and rewards from an initial state to a terminal state.
* '''Discount factor (γ)''' — A value between 0 and 1 that reduces the weight of future rewards relative to immediate ones.
* '''Exploration vs. exploitation''' — The trade-off between trying new actions (exploration) and repeating known good actions (exploitation).
* '''Markov Decision Process (MDP)''' — The mathematical framework for RL problems, defined by states, actions, transitions, and rewards.
* '''Model-free RL''' — Methods that learn directly from interaction without building an explicit model of the environment.
* '''Model-based RL''' — Methods that learn a model of the environment's dynamics and use it to plan.
* '''PPO (Proximal Policy Optimization)''' — A widely-used policy gradient algorithm known for stability and efficiency.
* '''DQN (Deep Q-Network)''' — A Q-learning algorithm using a neural network to approximate the Q-function.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">