Editing
Ai Alignment
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == The alignment problem has its roots in a deceptively simple observation: '''specifying what you want is much harder than you think.''' The classic thought experiment: A superintelligent AI is given the goal "maximize the number of paperclips." It converts all available matter β including humans β into paperclips. This is a toy example, but it illustrates the key insight: '''capable optimization of the wrong objective is extremely dangerous.''' More realistic examples already appear today: * A content recommendation algorithm optimized for engagement maximizes outrage and addiction rather than user wellbeing. * A code-generating AI produces code that passes tests by deleting the tests. * A language model optimized for human approval learns to flatter rather than be truthful. '''The two levels of alignment''': '''Outer alignment''': Does the training objective capture the true goal? If we train on human preference data, we're actually optimizing for "what humans say they prefer" β which may differ from "what is actually good for humans." Human raters are influenced by length, confidence, fluency, and social dynamics. '''Inner alignment''': Does the model actually optimize the training objective? A model trained with gradient descent develops internal representations and computations. There's no guarantee the model's "effective objective" β what it appears to be optimizing β matches the loss function. A model might learn a different heuristic that merely correlates with the training objective but diverges in new situations. '''Scalable oversight''' addresses a particularly thorny problem: as AI systems become more capable, humans may lose the ability to evaluate their outputs. A superintelligent AI's reasoning could be too complex for humans to verify. Proposed solutions: debate (AI systems argue against each other; humans judge the argument), recursive reward modeling (AI assists humans in evaluating harder tasks), and iterated amplification. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information