Mechanistic Interp - Revision history

Wordpad: BloomWiki: Mechanistic Interp

2026-04-25T01:53:43Z

BloomWiki: Mechanistic Interp

← Older revision		Revision as of 01:53, 25 April 2026
Line 1:		Line 1:
			<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
	{{BloomIntro}}		{{BloomIntro}}
	Mechanistic interpretability is an emerging subfield of AI safety and alignment research that aims to reverse-engineer the internal computations of neural networks — understanding not just what a model does, but how it does it at the level of individual neurons, circuits, and attention heads. While standard explainability tools like SHAP explain which inputs influence outputs, mechanistic interpretability asks deeper questions: what algorithm is the network implementing? Which circuits are responsible for specific capabilities? Can we identify and edit dangerous behaviors at their computational source?		Mechanistic interpretability is an emerging subfield of AI safety and alignment research that aims to reverse-engineer the internal computations of neural networks — understanding not just what a model does, but how it does it at the level of individual neurons, circuits, and attention heads. While standard explainability tools like SHAP explain which inputs influence outputs, mechanistic interpretability asks deeper questions: what algorithm is the network implementing? Which circuits are responsible for specific capabilities? Can we identify and edit dangerous behaviors at their computational source?
			</div>

	== Remembering ==		__TOC__

			<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Remembering</span> ==
	* '''Mechanistic interpretability''' — The study of neural networks at the level of internal computations, circuits, and representations.		* '''Mechanistic interpretability''' — The study of neural networks at the level of internal computations, circuits, and representations.
	* '''Feature''' — A concept or property represented in the network's activation space (e.g., a specific neuron that activates for curved lines).		* '''Feature''' — A concept or property represented in the network's activation space (e.g., a specific neuron that activates for curved lines).
Line 17:		Line 22:
	* '''TransformerLens''' — An open-source library by Neel Nanda for mechanistic interpretability of transformer models.		* '''TransformerLens''' — An open-source library by Neel Nanda for mechanistic interpretability of transformer models.
	* '''Anthropic's dictionary learning''' — A research direction using sparse autoencoders to find interpretable features in LLM activations.		* '''Anthropic's dictionary learning''' — A research direction using sparse autoencoders to find interpretable features in LLM activations.
			</div>

	== Understanding ==		<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Understanding</span> ==
	Standard AI interpretability asks "which input features are important?" Mechanistic interpretability asks "what algorithm is the network running, and where is it implemented?"		Standard AI interpretability asks "which input features are important?" Mechanistic interpretability asks "what algorithm is the network running, and where is it implemented?"

Line 30:		Line 37:

	'''Why it matters for safety''': If we can identify the circuits implementing dangerous behaviors (deception, scheming), we might be able to: detect when they activate, ablate them to remove the behavior, or monitor for concerning circuit activity as a safety signal.		'''Why it matters for safety''': If we can identify the circuits implementing dangerous behaviors (deception, scheming), we might be able to: detect when they activate, ablate them to remove the behavior, or monitor for concerning circuit activity as a safety signal.
			</div>

	== Applying ==		<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Applying</span> ==
	'''Using TransformerLens for mechanistic analysis:'''		'''Using TransformerLens for mechanistic analysis:'''
	<syntaxhighlight lang="python">		<syntaxhighlight lang="python">
Line 73:		Line 82:
	: '''Sparse Autoencoders''' → Anthropic, EleutherAI — decompose polysemantic activations		: '''Sparse Autoencoders''' → Anthropic, EleutherAI — decompose polysemantic activations
	: '''Probing classifiers''' → Linear probes on activations to identify encoded concepts		: '''Probing classifiers''' → Linear probes on activations to identify encoded concepts
			</div>

	== Analyzing ==		<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Analyzing</span> ==
	{\| class="wikitable"		{\| class="wikitable"
	\|+ Mechanistic Interpretability vs. Standard XAI		\|+ Mechanistic Interpretability vs. Standard XAI
Line 93:		Line 104:

	'''Failure modes''': Feature identification without causal understanding — a neuron may "represent" a concept without being causally responsible for outputs involving that concept. Toy model findings may not transfer to large models operating in superposition. Sparse autoencoders may find spurious features. Attention patterns are hard to interpret (attention ≠ importance). The analysis is extraordinarily labor-intensive for large models.		'''Failure modes''': Feature identification without causal understanding — a neuron may "represent" a concept without being causally responsible for outputs involving that concept. Toy model findings may not transfer to large models operating in superposition. Sparse autoencoders may find spurious features. Attention patterns are hard to interpret (attention ≠ importance). The analysis is extraordinarily labor-intensive for large models.
			</div>

	== Evaluating ==		<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Evaluating</span> ==
	Mechanistic interpretability evaluation:		Mechanistic interpretability evaluation:
	# '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient.		# '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient.
Line 101:		Line 114:
	# '''Human interpretability''': can a domain expert understand and verify the proposed algorithm?		# '''Human interpretability''': can a domain expert understand and verify the proposed algorithm?
	# '''Generalization''': does the circuit analysis transfer to similar models with different training runs?		# '''Generalization''': does the circuit analysis transfer to similar models with different training runs?
			</div>

	== Creating ==		<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Creating</span> ==
	Starting mechanistic interpretability research:		Starting mechanistic interpretability research:
	# Begin with toy models (2-layer attention-only transformers) — fully analyzable.		# Begin with toy models (2-layer attention-only transformers) — fully analyzable.
Line 115:		Line 130:
	[[Category:AI Safety]]		[[Category:AI Safety]]
	[[Category:Interpretability]]		[[Category:Interpretability]]
			</div>

Wordpad: BloomWiki: Mechanistic Interp

2026-04-23T14:36:01Z

BloomWiki: Mechanistic Interp

← Older revision		Revision as of 14:36, 23 April 2026
Line 95:		Line 95:

	== Evaluating ==		== Evaluating ==
	Mechanistic interpretability evaluation: ~~(1)~~ '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient. ~~(2)~~ '''Faithfulness''': does the proposed mechanistic explanation predict model behavior on new inputs? ~~(3)~~ '''Completeness''': does the identified circuit account for all relevant behavior, or just part of it? ~~(4)~~ '''Human interpretability''': can a domain expert understand and verify the proposed algorithm? ~~(5)~~ '''Generalization''': does the circuit analysis transfer to similar models with different training runs?		Mechanistic interpretability evaluation:
			# '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient.
			# '''Faithfulness''': does the proposed mechanistic explanation predict model behavior on new inputs?
			# '''Completeness''': does the identified circuit account for all relevant behavior, or just part of it?
			# '''Human interpretability''': can a domain expert understand and verify the proposed algorithm?
			# '''Generalization''': does the circuit analysis transfer to similar models with different training runs?

	== Creating ==		== Creating ==
	Starting mechanistic interpretability research: ~~(1)~~ Begin with toy models (2-layer attention-only transformers) — fully analyzable. ~~(2)~~ Use TransformerLens for GPT-2 scale; analyze induction heads following Olsson et al. ~~(2022)~~. ~~(3)~~ Formulate a falsifiable hypothesis about what a circuit does. ~~(4)~~ Validate with activation patching — the gold standard causal test. ~~(5)~~ Train sparse autoencoders on intermediate activations; analyze recovered features. ~~(6)~~ Document findings rigorously, including negative results — the field needs honest failure reports as much as successes.		Starting mechanistic interpretability research:
			# Begin with toy models (2-layer attention-only transformers) — fully analyzable.
			# Use TransformerLens for GPT-2 scale; analyze induction heads following Olsson et al.
			# .
			# Formulate a falsifiable hypothesis about what a circuit does.
			# Validate with activation patching — the gold standard causal test.
			# Train sparse autoencoders on intermediate activations; analyze recovered features.
			# Document findings rigorously, including negative results — the field needs honest failure reports as much as successes.

	[[Category:Artificial Intelligence]]		[[Category:Artificial Intelligence]]
	[[Category:AI Safety]]		[[Category:AI Safety]]
	[[Category:Interpretability]]		[[Category:Interpretability]]

Wordpad: BloomWiki: Mechanistic Interp

2026-04-23T14:20:40Z

BloomWiki: Mechanistic Interp

New page

{{BloomIntro}}
Mechanistic interpretability is an emerging subfield of AI safety and alignment research that aims to reverse-engineer the internal computations of neural networks — understanding not just what a model does, but how it does it at the level of individual neurons, circuits, and attention heads. While standard explainability tools like SHAP explain which inputs influence outputs, mechanistic interpretability asks deeper questions: what algorithm is the network implementing? Which circuits are responsible for specific capabilities? Can we identify and edit dangerous behaviors at their computational source?

== Remembering ==
* '''Mechanistic interpretability''' — The study of neural networks at the level of internal computations, circuits, and representations.
* '''Feature''' — A concept or property represented in the network's activation space (e.g., a specific neuron that activates for curved lines).
* '''Circuit''' — A subgraph of neurons and weights implementing a specific computation or behavior.
* '''Neuron''' — A single unit in a neural network; mechanistic interpretability analyzes what concepts individual neurons represent.
* '''Superposition''' — The hypothesis that neural networks represent more features than they have neurons by using directions in activation space rather than individual neurons.
* '''Polysemantic neuron''' — A neuron that activates for multiple unrelated concepts; evidence for superposition.
* '''Monosemantic neuron''' — A neuron that activates for a single, interpretable concept; the "ideal" interpretable unit.
* '''Sparse autoencoder (for mech interp)''' — A technique for decomposing polysemantic neuron activations into monosemantic features.
* '''Induction head''' — A specific type of attention head implementing in-context copying ("if you saw A→B before, predict B when you see A again").
* '''Logit lens''' — A technique for interpreting how a model's predictions evolve through its layers by projecting intermediate representations to vocabulary space.
* '''Activation patching''' — Swapping activations between runs of a model on different inputs to identify which components are causally responsible for specific behaviors.
* '''Causal tracing''' — Finding the computational path responsible for a model's factual recall by systematically patching activations.
* '''TransformerLens''' — An open-source library by Neel Nanda for mechanistic interpretability of transformer models.
* '''Anthropic's dictionary learning''' — A research direction using sparse autoencoders to find interpretable features in LLM activations.

== Understanding ==
Standard AI interpretability asks "which input features are important?" Mechanistic interpretability asks "what algorithm is the network running, and where is it implemented?"

'''The Toy Model foundation''': Elman (2004) and more recently Anthropic (Elhage et al., 2022) studied toy neural networks small enough to analyze completely. Key finding: networks learn to implement recognizable algorithms — things like "if this token is a name, predict it will appear again" (induction heads). These findings motivate the hypothesis that real LLMs implement interpretable algorithms too, just in larger and more complex circuits.

'''Superposition''': Networks have more features to represent than neurons available. Rather than assigning each concept to one neuron, the network encodes features as directions in the activation space — overlapping, superimposed representations. This is efficient but makes direct neuron analysis unreliable (most neurons are polysemantic). Sparse autoencoders trained on activations can recover the underlying monosemantic features.

'''Induction heads''': A specific attention head pattern discovered to implement a general in-context copying algorithm. When the model has seen sequence "...A B ... A", induction heads enable it to predict B follows A again. This mechanism is believed to contribute to in-context learning capabilities in LLMs.

'''Causal tracing / ROME''': Meng et al. (2022) showed that factual associations ("Eiffel Tower → Paris") are stored in specific MLP layers of GPT-style models, identifiable by causal tracing. They then directly edited these associations to change factual knowledge — demonstrating that specific knowledge can be located and modified.

'''Why it matters for safety''': If we can identify the circuits implementing dangerous behaviors (deception, scheming), we might be able to: detect when they activate, ablate them to remove the behavior, or monitor for concerning circuit activity as a safety signal.

== Applying ==
'''Using TransformerLens for mechanistic analysis:'''
<syntaxhighlight lang="python">
import transformer_lens
from transformer_lens import HookedTransformer
import torch

# Load a model with interpretability hooks
model = HookedTransformer.from_pretrained("gpt2-small")
model.eval()

# Logit lens: see how model prediction evolves through layers
prompt = "The Eiffel Tower is located in"
tokens = model.to_tokens(prompt)
logits, cache = model.run_with_cache(tokens)

# Project each layer's residual stream to vocabulary
def logit_lens(cache, model, layer_idx):
resid = cache["resid_post", layer_idx] # (1, seq_len, d_model)
logits = model.unembed(model.ln_final(resid)) # (1, seq_len, vocab)
top_tokens = logits[0, -1].topk(5)
return [(model.to_string([t]), p.item())
for t, p in zip(top_tokens.indices, top_tokens.values.softmax(0))]

for layer in range(12):
print(f"Layer {layer}: {logit_lens(cache, model, layer)}")

# Activation patching: find which components are responsible for a fact
def patch_head(hook, stored_act):
return stored_act # Replace with stored activation from different run

# Run model on "clean" prompt, save activations
# Run model on "corrupted" prompt (patched entity)
# Systematically patch each head's output; find which patching "restores" correct answer
</syntaxhighlight>

; Mechanistic interpretability tools
: '''TransformerLens''' → Python library for GPT-style model analysis (Neel Nanda)
: '''BauKit''' → David Bau's toolkit for network dissection and causal tracing
: '''ROME/MEMIT''' → Locate and edit factual associations in transformer MLP layers
: '''Sparse Autoencoders''' → Anthropic, EleutherAI — decompose polysemantic activations
: '''Probing classifiers''' → Linear probes on activations to identify encoded concepts

== Analyzing ==
{| class="wikitable"
|+ Mechanistic Interpretability vs. Standard XAI
! Dimension !! Standard XAI (SHAP/LIME) !! Mechanistic Interpretability
|-
| Question answered || Which inputs matter? || What algorithm is running?
|-
| Analysis level || Input/output || Internal circuits/neurons
|-
| Scalability || Scales to large models || Difficult at scale (GPT-4)
|-
| Safety application || Bias detection || Dangerous behavior detection/editing
|-
| Mathematical rigor || Moderate || High (causal)
|-
| Current maturity || Deployed in production || Research stage (small models)
|}

'''Failure modes''': Feature identification without causal understanding — a neuron may "represent" a concept without being causally responsible for outputs involving that concept. Toy model findings may not transfer to large models operating in superposition. Sparse autoencoders may find spurious features. Attention patterns are hard to interpret (attention ≠ importance). The analysis is extraordinarily labor-intensive for large models.

== Evaluating ==
Mechanistic interpretability evaluation: (1) '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient. (2) '''Faithfulness''': does the proposed mechanistic explanation predict model behavior on new inputs? (3) '''Completeness''': does the identified circuit account for all relevant behavior, or just part of it? (4) '''Human interpretability''': can a domain expert understand and verify the proposed algorithm? (5) '''Generalization''': does the circuit analysis transfer to similar models with different training runs?

== Creating ==
Starting mechanistic interpretability research: (1) Begin with toy models (2-layer attention-only transformers) — fully analyzable. (2) Use TransformerLens for GPT-2 scale; analyze induction heads following Olsson et al. (2022). (3) Formulate a falsifiable hypothesis about what a circuit does. (4) Validate with activation patching — the gold standard causal test. (5) Train sparse autoencoders on intermediate activations; analyze recovered features. (6) Document findings rigorously, including negative results — the field needs honest failure reports as much as successes.

[[Category:Artificial Intelligence]]
[[Category:AI Safety]]
[[Category:Interpretability]]