Mechanistic Interp: Difference between revisions

Revision as of 14:36, 23 April 2026

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Mechanistic interpretability is an emerging subfield of AI safety and alignment research that aims to reverse-engineer the internal computations of neural networks — understanding not just what a model does, but how it does it at the level of individual neurons, circuits, and attention heads. While standard explainability tools like SHAP explain which inputs influence outputs, mechanistic interpretability asks deeper questions: what algorithm is the network implementing? Which circuits are responsible for specific capabilities? Can we identify and edit dangerous behaviors at their computational source?

Remembering

Mechanistic interpretability — The study of neural networks at the level of internal computations, circuits, and representations.
Feature — A concept or property represented in the network's activation space (e.g., a specific neuron that activates for curved lines).
Circuit — A subgraph of neurons and weights implementing a specific computation or behavior.
Neuron — A single unit in a neural network; mechanistic interpretability analyzes what concepts individual neurons represent.
Superposition — The hypothesis that neural networks represent more features than they have neurons by using directions in activation space rather than individual neurons.
Polysemantic neuron — A neuron that activates for multiple unrelated concepts; evidence for superposition.
Monosemantic neuron — A neuron that activates for a single, interpretable concept; the "ideal" interpretable unit.
Sparse autoencoder (for mech interp) — A technique for decomposing polysemantic neuron activations into monosemantic features.
Induction head — A specific type of attention head implementing in-context copying ("if you saw A→B before, predict B when you see A again").
Logit lens — A technique for interpreting how a model's predictions evolve through its layers by projecting intermediate representations to vocabulary space.
Activation patching — Swapping activations between runs of a model on different inputs to identify which components are causally responsible for specific behaviors.
Causal tracing — Finding the computational path responsible for a model's factual recall by systematically patching activations.
TransformerLens — An open-source library by Neel Nanda for mechanistic interpretability of transformer models.
Anthropic's dictionary learning — A research direction using sparse autoencoders to find interpretable features in LLM activations.

Understanding

Standard AI interpretability asks "which input features are important?" Mechanistic interpretability asks "what algorithm is the network running, and where is it implemented?"

The Toy Model foundation: Elman (2004) and more recently Anthropic (Elhage et al., 2022) studied toy neural networks small enough to analyze completely. Key finding: networks learn to implement recognizable algorithms — things like "if this token is a name, predict it will appear again" (induction heads). These findings motivate the hypothesis that real LLMs implement interpretable algorithms too, just in larger and more complex circuits.

Superposition: Networks have more features to represent than neurons available. Rather than assigning each concept to one neuron, the network encodes features as directions in the activation space — overlapping, superimposed representations. This is efficient but makes direct neuron analysis unreliable (most neurons are polysemantic). Sparse autoencoders trained on activations can recover the underlying monosemantic features.

Induction heads: A specific attention head pattern discovered to implement a general in-context copying algorithm. When the model has seen sequence "...A B ... A", induction heads enable it to predict B follows A again. This mechanism is believed to contribute to in-context learning capabilities in LLMs.

Causal tracing / ROME: Meng et al. (2022) showed that factual associations ("Eiffel Tower → Paris") are stored in specific MLP layers of GPT-style models, identifiable by causal tracing. They then directly edited these associations to change factual knowledge — demonstrating that specific knowledge can be located and modified.

Why it matters for safety: If we can identify the circuits implementing dangerous behaviors (deception, scheming), we might be able to: detect when they activate, ablate them to remove the behavior, or monitor for concerning circuit activity as a safety signal.

Applying

Using TransformerLens for mechanistic analysis: <syntaxhighlight lang="python"> import transformer_lens from transformer_lens import HookedTransformer import torch

Load a model with interpretability hooks

model = HookedTransformer.from_pretrained("gpt2-small") model.eval()

Logit lens: see how model prediction evolves through layers

prompt = "The Eiffel Tower is located in" tokens = model.to_tokens(prompt) logits, cache = model.run_with_cache(tokens)

Project each layer's residual stream to vocabulary

def logit_lens(cache, model, layer_idx):

   resid = cache["resid_post", layer_idx]  # (1, seq_len, d_model)
   logits = model.unembed(model.ln_final(resid))  # (1, seq_len, vocab)
   top_tokens = logits[0, -1].topk(5)
   return [(model.to_string([t]), p.item())
           for t, p in zip(top_tokens.indices, top_tokens.values.softmax(0))]

for layer in range(12):

   print(f"Layer {layer}: {logit_lens(cache, model, layer)}")

Activation patching: find which components are responsible for a fact

def patch_head(hook, stored_act):

   return stored_act  # Replace with stored activation from different run

Run model on "clean" prompt, save activations
Run model on "corrupted" prompt (patched entity)
Systematically patch each head's output; find which patching "restores" correct answer

</syntaxhighlight>

Mechanistic interpretability tools: TransformerLens → Python library for GPT-style model analysis (Neel Nanda); BauKit → David Bau's toolkit for network dissection and causal tracing; ROME/MEMIT → Locate and edit factual associations in transformer MLP layers; Sparse Autoencoders → Anthropic, EleutherAI — decompose polysemantic activations; Probing classifiers → Linear probes on activations to identify encoded concepts

Analyzing

Mechanistic Interpretability vs. Standard XAI
Dimension	Standard XAI (SHAP/LIME)	Mechanistic Interpretability
Question answered	Which inputs matter?	What algorithm is running?
Analysis level	Input/output	Internal circuits/neurons
Scalability	Scales to large models	Difficult at scale (GPT-4)
Safety application	Bias detection	Dangerous behavior detection/editing
Mathematical rigor	Moderate	High (causal)
Current maturity	Deployed in production	Research stage (small models)

Failure modes: Feature identification without causal understanding — a neuron may "represent" a concept without being causally responsible for outputs involving that concept. Toy model findings may not transfer to large models operating in superposition. Sparse autoencoders may find spurious features. Attention patterns are hard to interpret (attention ≠ importance). The analysis is extraordinarily labor-intensive for large models.

Evaluating

Mechanistic interpretability evaluation:

Causal validation: does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient.
Faithfulness: does the proposed mechanistic explanation predict model behavior on new inputs?
Completeness: does the identified circuit account for all relevant behavior, or just part of it?
Human interpretability: can a domain expert understand and verify the proposed algorithm?
Generalization: does the circuit analysis transfer to similar models with different training runs?

Creating

Starting mechanistic interpretability research:

Begin with toy models (2-layer attention-only transformers) — fully analyzable.
Use TransformerLens for GPT-2 scale; analyze induction heads following Olsson et al.
.
Formulate a falsifiable hypothesis about what a circuit does.
Validate with activation patching — the gold standard causal test.
Train sparse autoencoders on intermediate activations; analyze recovered features.
Document findings rigorously, including negative results — the field needs honest failure reports as much as successes.

@@ Line 95: / Line 95: @@
 == Evaluating ==
-Mechanistic interpretability evaluation: (1) '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient. (2) '''Faithfulness''': does the proposed mechanistic explanation predict model behavior on new inputs? (3) '''Completeness''': does the identified circuit account for all relevant behavior, or just part of it? (4) '''Human interpretability''': can a domain expert understand and verify the proposed algorithm? (5) '''Generalization''': does the circuit analysis transfer to similar models with different training runs?
+Mechanistic interpretability evaluation:
+# '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient.
+# '''Faithfulness''': does the proposed mechanistic explanation predict model behavior on new inputs?
+# '''Completeness''': does the identified circuit account for all relevant behavior, or just part of it?
+# '''Human interpretability''': can a domain expert understand and verify the proposed algorithm?
+# '''Generalization''': does the circuit analysis transfer to similar models with different training runs?
 == Creating ==
-Starting mechanistic interpretability research: (1) Begin with toy models (2-layer attention-only transformers) — fully analyzable. (2) Use TransformerLens for GPT-2 scale; analyze induction heads following Olsson et al. (2022). (3) Formulate a falsifiable hypothesis about what a circuit does. (4) Validate with activation patching — the gold standard causal test. (5) Train sparse autoencoders on intermediate activations; analyze recovered features. (6) Document findings rigorously, including negative results — the field needs honest failure reports as much as successes.
+Starting mechanistic interpretability research:
+# Begin with toy models (2-layer attention-only transformers) — fully analyzable.
+# Use TransformerLens for GPT-2 scale; analyze induction heads following Olsson et al.
+# .
+# Formulate a falsifiable hypothesis about what a circuit does.
+# Validate with activation patching — the gold standard causal test.
+# Train sparse autoencoders on intermediate activations; analyze recovered features.
+# Document findings rigorously, including negative results — the field needs honest failure reports as much as successes.
 [[Category:Artificial Intelligence]]
 [[Category:AI Safety]]
 [[Category:Interpretability]]

Mechanistic Interp: Difference between revisions

Revision as of 14:36, 23 April 2026

Contents

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Mechanistic Interp: Difference between revisions

Revision as of 14:36, 23 April 2026

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Search