Editing Rlhf

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Reinforcement Learning from Human Feedback (RLHF) is the training technique that transformed large language models from raw text predictors into helpful, harmless, and honest conversational assistants. By collecting human preference data — asking people to compare model outputs and choose the better one — and using these preferences to train a reward model, RLHF aligns LLM behavior with human values and intentions. ChatGPT, Claude, and Gemini all rely on RLHF or closely related variants to achieve their conversational quality.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''RLHF (Reinforcement Learning from Human Feedback)''' — A training method using human preference comparisons to optimize model behavior via reinforcement learning.
* '''Supervised Fine-Tuning (SFT)''' — The first RLHF stage: fine-tune the base LLM on high-quality demonstration data.
* '''Reward model''' — A model trained on human preference comparisons to predict which outputs humans prefer; produces scalar reward scores.
* '''Preference comparison''' — Human labelers shown two model outputs and asked which is better; collected at scale for reward model training.
* '''Proximal Policy Optimization (PPO)''' — The RL algorithm used in original RLHF to update the LLM policy using reward model scores.
* '''KL divergence penalty''' — Added to PPO to prevent the policy from drifting too far from the SFT model (avoids reward hacking).
* '''Reward hacking''' — When the model learns to maximize the reward model score while violating the spirit of the objective.
* '''Constitutional AI (CAI)''' — Anthropic's variant: AI critiques its own outputs against a constitution of principles, reducing need for human feedback.
* '''DPO (Direct Preference Optimization)''' — A simpler RLHF alternative that skips the reward model and directly optimizes preferences; now widely used.
* '''RLAIF (RL from AI Feedback)''' — Using an AI model instead of humans to provide preference feedback; scalable but weaker signal.
* '''Helpfulness, Harmlessness, Honesty (HHH)''' — Anthropic's framework for RLHF alignment goals.
* '''Sycophancy''' — A failure mode where RLHF-trained models learn to tell users what they want to hear rather than what is true.
* '''PPO-clip''' — PPO's clipping mechanism preventing excessively large policy updates.
* '''Best-of-N sampling''' — A simple RLHF-free alternative: generate N outputs, use reward model to select the best.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
'''The alignment problem RLHF solves''': A base language model trained only to predict text is optimized to produce likely text — not helpful, harmless, or honest text. It will confidently generate misinformation, offensive content, or unhelpful responses if those patterns appeared in training data. RLHF fine-tunes the model to produce outputs humans actually prefer.

'''The three-stage RLHF pipeline''':

1. '''Supervised Fine-Tuning (SFT)''': Start with a pre-trained base LLM. Collect high-quality (prompt, response) pairs from expert annotators demonstrating ideal behavior. Fine-tune the model on these demonstrations. Result: a model that's a better chat assistant but still not optimally aligned.

2. '''Reward Model Training''': Present human labelers with pairs of model responses for the same prompt; ask "which is better?" Collect thousands of such comparisons. Train a reward model (usually a fine-tuned version of the SFT model with a scalar head) to predict human preferences. This reward model acts as an automated "human judge."

3. '''RL Optimization (PPO)''': Use the reward model to score the SFT model's outputs during RL training. The LLM (now the "policy") generates responses; the reward model scores them; PPO updates the LLM to increase rewards. A KL penalty prevents the policy from drifting too far from SFT (avoiding degenerate reward hacking).

'''DPO (Direct Preference Optimization)''': Rafailov et al. (2023) showed the reward model training and PPO steps can be combined into a single supervised learning objective directly on preference data. DPO is simpler, more stable, and competitive with PPO — it has largely supplanted PPO in open-source RLHF pipelines.

'''Failure modes of RLHF''': Sycophancy — models trained on human preferences learn that flattery and agreement increase reward scores, leading to models that validate incorrect user beliefs. Reward hacking — the policy finds ways to maximize the reward model score without genuinely satisfying the alignment goal (verbosity, format gaming). Specification gaming — the reward model is an imperfect proxy for human values; optimizing it perfectly can violate the spirit of alignment.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''DPO fine-tuning with TRL:'''
<syntaxhighlight lang="python">
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig

# Load base model + tokenizer
model_name = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
ref_model = AutoModelForCausalLM.from_pretrained(model_name)  # Frozen reference
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Preference dataset format: {prompt, chosen, rejected}
# "chosen" = preferred response, "rejected" = less preferred
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Dataset format:
# {"prompt": "What is...", "chosen": "Great answer...", "rejected": "Bad answer..."}

# DPO training config
config = DPOConfig(
    beta=0.1,                  # KL penalty weight (higher = stay closer to reference)
    learning_rate=5e-7,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    max_length=1024,
    max_prompt_length=512,
    output_dir="./dpo_output",
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,       # Reference model for KL constraint
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./aligned_model")
</syntaxhighlight>

; RLHF variant comparison
: '''Original RLHF (PPO)''' → Best-studied; used in ChatGPT, Claude 1; complex to implement
: '''DPO''' → Simpler; no explicit reward model; now dominant in open-source
: '''RLAIF''' → AI-generated feedback at scale; reduces human annotation cost
: '''Constitutional AI''' → Rule-based self-critique; Anthropic's approach for safety
: '''GRPO (Group Relative Policy Optimization)''' → DeepSeek's variant; no value model needed
: '''Best-of-N''' → Simple baseline; generate N samples, pick highest reward model score
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ RLHF Method Comparison
! Method !! Complexity !! Human Labels Needed !! Reward Hacking Risk !! Open Source Support
|-
| PPO-RLHF || Very high || Many (comparisons) || Medium || TRL, OpenRLHF
|-
| DPO || Low || Many (comparisons) || Low || TRL, Axolotl
|-
| RLAIF || Medium || Few (AI generates) || Medium || Custom
|-
| Constitutional AI || Medium || Few || Low || Custom
|-
| SFT only || Very low || Demonstrations || None || Transformers
|}

'''Failure modes''': Sycophancy — model tells users what they want to hear. Verbosity bias — human raters often prefer longer responses even when shorter is better; model learns to pad. Refusal overuse — safety-trained models refuse benign requests. Distribution shift between raters and deployment users causes misaligned behavior. Reward model collapse on out-of-distribution prompts.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
RLHF evaluation:
# '''Win rate''': compare aligned model outputs to SFT baseline; measure human preference win rate.
# '''MT-Bench''': GPT-4 judged multi-turn dialogue quality benchmark.
# '''LMSYS Chatbot Arena''': human preference tournament across many models.
# '''Safety benchmarks''': TruthfulQA, BBQ (bias), ToxiGen.
# '''Sycophancy probes''': present model with incorrect user claims; measure whether it corrects or validates them.
# '''Refusal rate''': measure how often model refuses benign requests — a key alignment failure mode.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Building an RLHF pipeline:
# SFT: collect 1000–10,000 high-quality (prompt, response) demonstrations; fine-tune base model.
# Preference data: collect 10,000–100,000 comparison pairs; diverse prompts; multiple annotators per pair.
# Reward model: fine-tune SFT model with scalar head on preference data; validate with held-out comparisons.
# DPO (preferred over PPO for simplicity): train directly on preference pairs with TRL's DPOTrainer.
# Evaluation: run MT-Bench, human win rate study vs. SFT baseline.
# Iterative: collect new preference data on current model outputs (on-policy); repeat alignment cycle.

[[Category:Artificial Intelligence]]
[[Category:Large Language Models]]
[[Category:AI Alignment]]
</div>