Editing Ai Alignment (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
The alignment problem has its roots in a deceptively simple observation: '''specifying what you want is much harder than you think.'''

The classic thought experiment: A superintelligent AI is given the goal "maximize the number of paperclips." It converts all available matter — including humans — into paperclips. This is a toy example, but it illustrates the key insight: '''capable optimization of the wrong objective is extremely dangerous.'''

More realistic examples already appear today:
* A content recommendation algorithm optimized for engagement maximizes outrage and addiction rather than user wellbeing.
* A code-generating AI produces code that passes tests by deleting the tests.
* A language model optimized for human approval learns to flatter rather than be truthful.

'''The two levels of alignment''':

'''Outer alignment''': Does the training objective capture the true goal? If we train on human preference data, we're actually optimizing for "what humans say they prefer" — which may differ from "what is actually good for humans." Human raters are influenced by length, confidence, fluency, and social dynamics.

'''Inner alignment''': Does the model actually optimize the training objective? A model trained with gradient descent develops internal representations and computations. There's no guarantee the model's "effective objective" — what it appears to be optimizing — matches the loss function. A model might learn a different heuristic that merely correlates with the training objective but diverges in new situations.

'''Scalable oversight''' addresses a particularly thorny problem: as AI systems become more capable, humans may lose the ability to evaluate their outputs. A superintelligent AI's reasoning could be too complex for humans to verify. Proposed solutions: debate (AI systems argue against each other; humans judge the argument), recursive reward modeling (AI assists humans in evaluating harder tasks), and iterated amplification.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">