Editing Mechanistic Interp (section)

== <span style="color: #FFFFFF;">Applying</span> ==
'''Using TransformerLens for mechanistic analysis:'''
<syntaxhighlight lang="python">
import transformer_lens
from transformer_lens import HookedTransformer
import torch

# Load a model with interpretability hooks
model = HookedTransformer.from_pretrained("gpt2-small")
model.eval()

# Logit lens: see how model prediction evolves through layers
prompt = "The Eiffel Tower is located in"
tokens = model.to_tokens(prompt)
logits, cache = model.run_with_cache(tokens)

# Project each layer's residual stream to vocabulary
def logit_lens(cache, model, layer_idx):
    resid = cache["resid_post", layer_idx]  # (1, seq_len, d_model)
    logits = model.unembed(model.ln_final(resid))  # (1, seq_len, vocab)
    top_tokens = logits[0, -1].topk(5)
    return [(model.to_string([t]), p.item())
            for t, p in zip(top_tokens.indices, top_tokens.values.softmax(0))]

for layer in range(12):
    print(f"Layer {layer}: {logit_lens(cache, model, layer)}")

# Activation patching: find which components are responsible for a fact
def patch_head(hook, stored_act):
    return stored_act  # Replace with stored activation from different run

# Run model on "clean" prompt, save activations
# Run model on "corrupted" prompt (patched entity)
# Systematically patch each head's output; find which patching "restores" correct answer
</syntaxhighlight>

; Mechanistic interpretability tools
: '''TransformerLens''' → Python library for GPT-style model analysis (Neel Nanda)
: '''BauKit''' → David Bau's toolkit for network dissection and causal tracing
: '''ROME/MEMIT''' → Locate and edit factual associations in transformer MLP layers
: '''Sparse Autoencoders''' → Anthropic, EleutherAI — decompose polysemantic activations
: '''Probing classifiers''' → Linear probes on activations to identify encoded concepts
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">