Editing
Mechanistic Interpretability
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == Standard AI interpretability asks "which input features are important?" Mechanistic interpretability asks "what algorithm is the network running, and where is it implemented?" **The Toy Model foundation**: Elman (2004) and more recently Anthropic (Elhage et al., 2022) studied toy neural networks small enough to analyze completely. Key finding: networks learn to implement recognizable algorithms β things like "if this token is a name, predict it will appear again" (induction heads). These findings motivate the hypothesis that real LLMs implement interpretable algorithms too, just in larger and more complex circuits. **Superposition**: Networks have more features to represent than neurons available. Rather than assigning each concept to one neuron, the network encodes features as directions in the activation space β overlapping, superimposed representations. This is efficient but makes direct neuron analysis unreliable (most neurons are polysemantic). Sparse autoencoders trained on activations can recover the underlying monosemantic features. **Induction heads**: A specific attention head pattern discovered to implement a general in-context copying algorithm. When the model has seen sequence "...A B ... A", induction heads enable it to predict B follows A again. This mechanism is believed to contribute to in-context learning capabilities in LLMs. **Causal tracing / ROME**: Meng et al. (2022) showed that factual associations ("Eiffel Tower β Paris") are stored in specific MLP layers of GPT-style models, identifiable by causal tracing. They then directly edited these associations to change factual knowledge β demonstrating that specific knowledge can be located and modified. **Why it matters for safety**: If we can identify the circuits implementing dangerous behaviors (deception, scheming), we might be able to: detect when they activate, ablate them to remove the behavior, or monitor for concerning circuit activity as a safety signal. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information