Editing Reinforcement Learning from Human Feedback (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''RLHF (Reinforcement Learning from Human Feedback)''' — A training method using human preference comparisons to optimize model behavior via reinforcement learning.
* '''Supervised Fine-Tuning (SFT)''' — The first RLHF stage: fine-tune the base LLM on high-quality demonstration data.
* '''Reward model''' — A model trained on human preference comparisons to predict which outputs humans prefer; produces scalar reward scores.
* '''Preference comparison''' — Human labelers shown two model outputs and asked which is better; collected at scale for reward model training.
* '''Proximal Policy Optimization (PPO)''' — The RL algorithm used in original RLHF to update the LLM policy using reward model scores.
* '''KL divergence penalty''' — Added to PPO to prevent the policy from drifting too far from the SFT model (avoids reward hacking).
* '''Reward hacking''' — When the model learns to maximize the reward model score while violating the spirit of the objective.
* '''Constitutional AI (CAI)''' — Anthropic's variant: AI critiques its own outputs against a constitution of principles, reducing need for human feedback.
* '''DPO (Direct Preference Optimization)''' — A simpler RLHF alternative that skips the reward model and directly optimizes preferences; now widely used.
* '''RLAIF (RL from AI Feedback)''' — Using an AI model instead of humans to provide preference feedback; scalable but weaker signal.
* '''Helpfulness, Harmlessness, Honesty (HHH)''' — Anthropic's framework for RLHF alignment goals.
* '''Sycophancy''' — A failure mode where RLHF-trained models learn to tell users what they want to hear rather than what is true.
* '''PPO-clip''' — PPO's clipping mechanism preventing excessively large policy updates.
* '''Best-of-N sampling''' — A simple RLHF-free alternative: generate N outputs, use reward model to select the best.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">