Editing
Rlhf
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == '''The alignment problem RLHF solves''': A base language model trained only to predict text is optimized to produce likely text β not helpful, harmless, or honest text. It will confidently generate misinformation, offensive content, or unhelpful responses if those patterns appeared in training data. RLHF fine-tunes the model to produce outputs humans actually prefer. '''The three-stage RLHF pipeline''': 1. '''Supervised Fine-Tuning (SFT)''': Start with a pre-trained base LLM. Collect high-quality (prompt, response) pairs from expert annotators demonstrating ideal behavior. Fine-tune the model on these demonstrations. Result: a model that's a better chat assistant but still not optimally aligned. 2. '''Reward Model Training''': Present human labelers with pairs of model responses for the same prompt; ask "which is better?" Collect thousands of such comparisons. Train a reward model (usually a fine-tuned version of the SFT model with a scalar head) to predict human preferences. This reward model acts as an automated "human judge." 3. '''RL Optimization (PPO)''': Use the reward model to score the SFT model's outputs during RL training. The LLM (now the "policy") generates responses; the reward model scores them; PPO updates the LLM to increase rewards. A KL penalty prevents the policy from drifting too far from SFT (avoiding degenerate reward hacking). '''DPO (Direct Preference Optimization)''': Rafailov et al. (2023) showed the reward model training and PPO steps can be combined into a single supervised learning objective directly on preference data. DPO is simpler, more stable, and competitive with PPO β it has largely supplanted PPO in open-source RLHF pipelines. '''Failure modes of RLHF''': Sycophancy β models trained on human preferences learn that flattery and agreement increase reward scores, leading to models that validate incorrect user beliefs. Reward hacking β the policy finds ways to maximize the reward model score without genuinely satisfying the alignment goal (verbosity, format gaming). Specification gaming β the reward model is an imperfect proxy for human values; optimizing it perfectly can violate the spirit of alignment. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information