Editing Reinforcement Learning (section)

== <span style="color: #FFFFFF;">Creating</span> ==
Designing an RL system from scratch requires careful specification of each MDP component:

'''1. State space design'''
* What information does the agent need to make good decisions?
* Raw pixels vs. hand-crafted features vs. learned embeddings?
* Partial observability? → Use recurrent policy (LSTM-based) or frame stacking

'''2. Action space design'''
* Discrete (choose from N options) vs. continuous (output a vector)?
* Action parameterization matters: joint angles vs. end-effector positions in robotics

'''3. Reward function design'''
* Dense vs. sparse: prefer dense when possible
* Normalize rewards to [-1, 1] or [0, 1]
* Potential-based shaping guarantees policy invariance

'''4. Pipeline architecture for RLHF:'''
<syntaxhighlight lang="text">
[Human Preference Data]
        ↓
[Train Reward Model (RM)]
        ↓
[Freeze RM] + [Reference LLM (frozen)]
        ↓
[PPO: Fine-tune LLM with RM signal + KL penalty]
        ↓
[Evaluated LLM Policy]
</syntaxhighlight>

'''5. Infrastructure checklist'''
* Vectorized environments (run N envs in parallel) for throughput
* GPU for policy/value network updates
* Logging: episode return, episode length, entropy, value loss, policy loss
* Curriculum: start with easier versions of the task, increase difficulty as agent improves

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Reinforcement Learning]]
</div>