Editing
Ai Alignment
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Remembering</span> == * '''Alignment''' β The degree to which an AI system's goals, values, and behaviors match what its developers and users actually intend. * '''Misalignment''' β When an AI system pursues objectives that differ from intended goals, potentially causing harm. * '''Goal specification''' β The problem of precisely and completely describing what we want an AI system to do. * '''Inner alignment''' β Ensuring the model learned from training actually optimizes the intended objective (as opposed to a proxy that was successful during training). * '''Outer alignment''' β Ensuring the training objective accurately captures the true goal we want the system to optimize. * '''Mesa-optimizer''' β A learned optimizer that emerges inside a model as a result of optimization pressure; may have different goals than the base optimizer. * '''Deceptive alignment''' β A theoretical failure mode where a model appears aligned during training but pursues different goals once deployed. * '''Reward hacking''' β When an AI exploits loopholes in a reward function to score high without achieving the intended outcome. * '''Goodhart's Law''' β "When a measure becomes a target, it ceases to be a good measure." Central to why reward hacking occurs. * '''Instrumental convergence''' β The hypothesis that sufficiently capable agents pursuing almost any goal will develop certain sub-goals (self-preservation, resource acquisition, avoiding goal modification) as instrumentally useful. * '''Value learning''' β Approaches where AI systems learn human values from human behavior and feedback rather than having values hardcoded. * '''RLHF''' β Reinforcement Learning from Human Feedback; a practical alignment technique using human preferences to shape model behavior. * '''Constitutional AI''' β An alignment technique (Anthropic) where the model critiques and revises its own outputs according to a set of principles. * '''Scalable oversight''' β Research on maintaining effective human oversight of AI systems as they become more capable than humans at specific tasks. * '''Interpretability''' β Understanding what computations occur inside AI models, enabling verification of whether they are pursuing intended goals. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information