Editing
Rlhf
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Reinforcement Learning from Human Feedback (RLHF) is the training technique that transformed large language models from raw text predictors into helpful, harmless, and honest conversational assistants. By collecting human preference data β asking people to compare model outputs and choose the better one β and using these preferences to train a reward model, RLHF aligns LLM behavior with human values and intentions. ChatGPT, Claude, and Gemini all rely on RLHF or closely related variants to achieve their conversational quality. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''RLHF (Reinforcement Learning from Human Feedback)''' β A training method using human preference comparisons to optimize model behavior via reinforcement learning. * '''Supervised Fine-Tuning (SFT)''' β The first RLHF stage: fine-tune the base LLM on high-quality demonstration data. * '''Reward model''' β A model trained on human preference comparisons to predict which outputs humans prefer; produces scalar reward scores. * '''Preference comparison''' β Human labelers shown two model outputs and asked which is better; collected at scale for reward model training. * '''Proximal Policy Optimization (PPO)''' β The RL algorithm used in original RLHF to update the LLM policy using reward model scores. * '''KL divergence penalty''' β Added to PPO to prevent the policy from drifting too far from the SFT model (avoids reward hacking). * '''Reward hacking''' β When the model learns to maximize the reward model score while violating the spirit of the objective. * '''Constitutional AI (CAI)''' β Anthropic's variant: AI critiques its own outputs against a constitution of principles, reducing need for human feedback. * '''DPO (Direct Preference Optimization)''' β A simpler RLHF alternative that skips the reward model and directly optimizes preferences; now widely used. * '''RLAIF (RL from AI Feedback)''' β Using an AI model instead of humans to provide preference feedback; scalable but weaker signal. * '''Helpfulness, Harmlessness, Honesty (HHH)''' β Anthropic's framework for RLHF alignment goals. * '''Sycophancy''' β A failure mode where RLHF-trained models learn to tell users what they want to hear rather than what is true. * '''PPO-clip''' β PPO's clipping mechanism preventing excessively large policy updates. * '''Best-of-N sampling''' β A simple RLHF-free alternative: generate N outputs, use reward model to select the best. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == '''The alignment problem RLHF solves''': A base language model trained only to predict text is optimized to produce likely text β not helpful, harmless, or honest text. It will confidently generate misinformation, offensive content, or unhelpful responses if those patterns appeared in training data. RLHF fine-tunes the model to produce outputs humans actually prefer. '''The three-stage RLHF pipeline''': 1. '''Supervised Fine-Tuning (SFT)''': Start with a pre-trained base LLM. Collect high-quality (prompt, response) pairs from expert annotators demonstrating ideal behavior. Fine-tune the model on these demonstrations. Result: a model that's a better chat assistant but still not optimally aligned. 2. '''Reward Model Training''': Present human labelers with pairs of model responses for the same prompt; ask "which is better?" Collect thousands of such comparisons. Train a reward model (usually a fine-tuned version of the SFT model with a scalar head) to predict human preferences. This reward model acts as an automated "human judge." 3. '''RL Optimization (PPO)''': Use the reward model to score the SFT model's outputs during RL training. The LLM (now the "policy") generates responses; the reward model scores them; PPO updates the LLM to increase rewards. A KL penalty prevents the policy from drifting too far from SFT (avoiding degenerate reward hacking). '''DPO (Direct Preference Optimization)''': Rafailov et al. (2023) showed the reward model training and PPO steps can be combined into a single supervised learning objective directly on preference data. DPO is simpler, more stable, and competitive with PPO β it has largely supplanted PPO in open-source RLHF pipelines. '''Failure modes of RLHF''': Sycophancy β models trained on human preferences learn that flattery and agreement increase reward scores, leading to models that validate incorrect user beliefs. Reward hacking β the policy finds ways to maximize the reward model score without genuinely satisfying the alignment goal (verbosity, format gaming). Specification gaming β the reward model is an imperfect proxy for human values; optimizing it perfectly can violate the spirit of alignment. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''DPO fine-tuning with TRL:''' <syntaxhighlight lang="python"> from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from trl import DPOTrainer, DPOConfig # Load base model + tokenizer model_name = "meta-llama/Llama-3.2-1B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_name) ref_model = AutoModelForCausalLM.from_pretrained(model_name) # Frozen reference tokenizer = AutoTokenizer.from_pretrained(model_name) # Preference dataset format: {prompt, chosen, rejected} # "chosen" = preferred response, "rejected" = less preferred dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train") # Dataset format: # {"prompt": "What is...", "chosen": "Great answer...", "rejected": "Bad answer..."} # DPO training config config = DPOConfig( beta=0.1, # KL penalty weight (higher = stay closer to reference) learning_rate=5e-7, per_device_train_batch_size=2, gradient_accumulation_steps=8, num_train_epochs=1, max_length=1024, max_prompt_length=512, output_dir="./dpo_output", ) trainer = DPOTrainer( model=model, ref_model=ref_model, # Reference model for KL constraint args=config, train_dataset=dataset, tokenizer=tokenizer, ) trainer.train() trainer.save_model("./aligned_model") </syntaxhighlight> ; RLHF variant comparison : '''Original RLHF (PPO)''' β Best-studied; used in ChatGPT, Claude 1; complex to implement : '''DPO''' β Simpler; no explicit reward model; now dominant in open-source : '''RLAIF''' β AI-generated feedback at scale; reduces human annotation cost : '''Constitutional AI''' β Rule-based self-critique; Anthropic's approach for safety : '''GRPO (Group Relative Policy Optimization)''' β DeepSeek's variant; no value model needed : '''Best-of-N''' β Simple baseline; generate N samples, pick highest reward model score </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ RLHF Method Comparison ! Method !! Complexity !! Human Labels Needed !! Reward Hacking Risk !! Open Source Support |- | PPO-RLHF || Very high || Many (comparisons) || Medium || TRL, OpenRLHF |- | DPO || Low || Many (comparisons) || Low || TRL, Axolotl |- | RLAIF || Medium || Few (AI generates) || Medium || Custom |- | Constitutional AI || Medium || Few || Low || Custom |- | SFT only || Very low || Demonstrations || None || Transformers |} '''Failure modes''': Sycophancy β model tells users what they want to hear. Verbosity bias β human raters often prefer longer responses even when shorter is better; model learns to pad. Refusal overuse β safety-trained models refuse benign requests. Distribution shift between raters and deployment users causes misaligned behavior. Reward model collapse on out-of-distribution prompts. </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == RLHF evaluation: # '''Win rate''': compare aligned model outputs to SFT baseline; measure human preference win rate. # '''MT-Bench''': GPT-4 judged multi-turn dialogue quality benchmark. # '''LMSYS Chatbot Arena''': human preference tournament across many models. # '''Safety benchmarks''': TruthfulQA, BBQ (bias), ToxiGen. # '''Sycophancy probes''': present model with incorrect user claims; measure whether it corrects or validates them. # '''Refusal rate''': measure how often model refuses benign requests β a key alignment failure mode. </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Building an RLHF pipeline: # SFT: collect 1000β10,000 high-quality (prompt, response) demonstrations; fine-tune base model. # Preference data: collect 10,000β100,000 comparison pairs; diverse prompts; multiple annotators per pair. # Reward model: fine-tune SFT model with scalar head on preference data; validate with held-out comparisons. # DPO (preferred over PPO for simplicity): train directly on preference pairs with TRL's DPOTrainer. # Evaluation: run MT-Bench, human win rate study vs. SFT baseline. # Iterative: collect new preference data on current model outputs (on-policy); repeat alignment cycle. [[Category:Artificial Intelligence]] [[Category:Large Language Models]] [[Category:AI Alignment]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information