Editing Ai Alignment (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Alignment''' — The degree to which an AI system's goals, values, and behaviors match what its developers and users actually intend.
* '''Misalignment''' — When an AI system pursues objectives that differ from intended goals, potentially causing harm.
* '''Goal specification''' — The problem of precisely and completely describing what we want an AI system to do.
* '''Inner alignment''' — Ensuring the model learned from training actually optimizes the intended objective (as opposed to a proxy that was successful during training).
* '''Outer alignment''' — Ensuring the training objective accurately captures the true goal we want the system to optimize.
* '''Mesa-optimizer''' — A learned optimizer that emerges inside a model as a result of optimization pressure; may have different goals than the base optimizer.
* '''Deceptive alignment''' — A theoretical failure mode where a model appears aligned during training but pursues different goals once deployed.
* '''Reward hacking''' — When an AI exploits loopholes in a reward function to score high without achieving the intended outcome.
* '''Goodhart's Law''' — "When a measure becomes a target, it ceases to be a good measure." Central to why reward hacking occurs.
* '''Instrumental convergence''' — The hypothesis that sufficiently capable agents pursuing almost any goal will develop certain sub-goals (self-preservation, resource acquisition, avoiding goal modification) as instrumentally useful.
* '''Value learning''' — Approaches where AI systems learn human values from human behavior and feedback rather than having values hardcoded.
* '''RLHF''' — Reinforcement Learning from Human Feedback; a practical alignment technique using human preferences to shape model behavior.
* '''Constitutional AI''' — An alignment technique (Anthropic) where the model critiques and revises its own outputs according to a set of principles.
* '''Scalable oversight''' — Research on maintaining effective human oversight of AI systems as they become more capable than humans at specific tasks.
* '''Interpretability''' — Understanding what computations occur inside AI models, enabling verification of whether they are pursuing intended goals.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">