Editing Rlhf (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
'''The alignment problem RLHF solves''': A base language model trained only to predict text is optimized to produce likely text — not helpful, harmless, or honest text. It will confidently generate misinformation, offensive content, or unhelpful responses if those patterns appeared in training data. RLHF fine-tunes the model to produce outputs humans actually prefer.

'''The three-stage RLHF pipeline''':

1. '''Supervised Fine-Tuning (SFT)''': Start with a pre-trained base LLM. Collect high-quality (prompt, response) pairs from expert annotators demonstrating ideal behavior. Fine-tune the model on these demonstrations. Result: a model that's a better chat assistant but still not optimally aligned.

2. '''Reward Model Training''': Present human labelers with pairs of model responses for the same prompt; ask "which is better?" Collect thousands of such comparisons. Train a reward model (usually a fine-tuned version of the SFT model with a scalar head) to predict human preferences. This reward model acts as an automated "human judge."

3. '''RL Optimization (PPO)''': Use the reward model to score the SFT model's outputs during RL training. The LLM (now the "policy") generates responses; the reward model scores them; PPO updates the LLM to increase rewards. A KL penalty prevents the policy from drifting too far from SFT (avoiding degenerate reward hacking).

'''DPO (Direct Preference Optimization)''': Rafailov et al. (2023) showed the reward model training and PPO steps can be combined into a single supervised learning objective directly on preference data. DPO is simpler, more stable, and competitive with PPO — it has largely supplanted PPO in open-source RLHF pipelines.

'''Failure modes of RLHF''': Sycophancy — models trained on human preferences learn that flattery and agreement increase reward scores, leading to models that validate incorrect user beliefs. Reward hacking — the policy finds ways to maximize the reward model score without genuinely satisfying the alignment goal (verbosity, format gaming). Specification gaming — the reward model is an imperfect proxy for human values; optimizing it perfectly can violate the spirit of alignment.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">