Editing Mechanistic Interp (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Mechanistic interpretability''' — The study of neural networks at the level of internal computations, circuits, and representations.
* '''Feature''' — A concept or property represented in the network's activation space (e.g., a specific neuron that activates for curved lines).
* '''Circuit''' — A subgraph of neurons and weights implementing a specific computation or behavior.
* '''Neuron''' — A single unit in a neural network; mechanistic interpretability analyzes what concepts individual neurons represent.
* '''Superposition''' — The hypothesis that neural networks represent more features than they have neurons by using directions in activation space rather than individual neurons.
* '''Polysemantic neuron''' — A neuron that activates for multiple unrelated concepts; evidence for superposition.
* '''Monosemantic neuron''' — A neuron that activates for a single, interpretable concept; the "ideal" interpretable unit.
* '''Sparse autoencoder (for mech interp)''' — A technique for decomposing polysemantic neuron activations into monosemantic features.
* '''Induction head''' — A specific type of attention head implementing in-context copying ("if you saw A→B before, predict B when you see A again").
* '''Logit lens''' — A technique for interpreting how a model's predictions evolve through its layers by projecting intermediate representations to vocabulary space.
* '''Activation patching''' — Swapping activations between runs of a model on different inputs to identify which components are causally responsible for specific behaviors.
* '''Causal tracing''' — Finding the computational path responsible for a model's factual recall by systematically patching activations.
* '''TransformerLens''' — An open-source library by Neel Nanda for mechanistic interpretability of transformer models.
* '''Anthropic's dictionary learning''' — A research direction using sparse autoencoders to find interpretable features in LLM activations.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">