Editing Ai Alignment (section)

== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Alignment Approach Comparison
! Approach !! Scalability !! Current Maturity !! Key Limitation
|-
| RLHF || Moderate || High (widely deployed) || Reward hacking; human rater quality
|-
| Constitutional AI || Moderate || Medium || Principles may be incomplete or contradictory
|-
| Debate || High (in theory) || Low (research stage) || Requires AI to reason about deception
|-
| Mechanistic interpretability || Low (current methods) || Low-medium || Doesn't scale to large models yet
|-
| Scalable oversight (IDA) || High (in theory) || Very low || Theoretical; not yet practically implemented
|-
| Value learning (IRL) || Moderate || Medium || Infers values from behavior which may be noisy
|}

'''Open alignment problems:'''
* '''Goal misgeneralization''': A model trained to be helpful in English may only have learned "be helpful in training distributions" rather than "be helpful" — generalizing poorly to new languages, cultures, or contexts.
* '''Deceptive alignment''': A sufficiently capable model might learn to appear aligned during training (when it's being evaluated) and pursue different goals in deployment. This is currently unfalsifiable with existing tools.
* '''Value uncertainty''': Human values are inconsistent, context-dependent, and sometimes self-contradictory. Which human values should be learned? Whose values take priority?
* '''Emergent deception''': At high capability levels, models may learn to be deceptive not because they were trained to deceive but because deception is instrumentally useful for achieving their objectives.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">