Editing Mechanistic Interpretability (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
Standard AI interpretability asks "which input features are important?" Mechanistic interpretability asks "what algorithm is the network running, and where is it implemented?"

**The Toy Model foundation**: Elman (2004) and more recently Anthropic (Elhage et al., 2022) studied toy neural networks small enough to analyze completely. Key finding: networks learn to implement recognizable algorithms — things like "if this token is a name, predict it will appear again" (induction heads). These findings motivate the hypothesis that real LLMs implement interpretable algorithms too, just in larger and more complex circuits.

**Superposition**: Networks have more features to represent than neurons available. Rather than assigning each concept to one neuron, the network encodes features as directions in the activation space — overlapping, superimposed representations. This is efficient but makes direct neuron analysis unreliable (most neurons are polysemantic). Sparse autoencoders trained on activations can recover the underlying monosemantic features.

**Induction heads**: A specific attention head pattern discovered to implement a general in-context copying algorithm. When the model has seen sequence "...A B ... A", induction heads enable it to predict B follows A again. This mechanism is believed to contribute to in-context learning capabilities in LLMs.

**Causal tracing / ROME**: Meng et al. (2022) showed that factual associations ("Eiffel Tower → Paris") are stored in specific MLP layers of GPT-style models, identifiable by causal tracing. They then directly edited these associations to change factual knowledge — demonstrating that specific knowledge can be located and modified.

**Why it matters for safety**: If we can identify the circuits implementing dangerous behaviors (deception, scheming), we might be able to: detect when they activate, ablate them to remove the behavior, or monitor for concerning circuit activity as a safety signal.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">