Editing Mechanistic Interp (section)

== <span style="color: #FFFFFF;">Evaluating</span> ==
Mechanistic interpretability evaluation:
# '''Causal validation''': does ablating/patching the identified circuit actually change behavior as predicted? Pure correlation is insufficient.
# '''Faithfulness''': does the proposed mechanistic explanation predict model behavior on new inputs?
# '''Completeness''': does the identified circuit account for all relevant behavior, or just part of it?
# '''Human interpretability''': can a domain expert understand and verify the proposed algorithm?
# '''Generalization''': does the circuit analysis transfer to similar models with different training runs?
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">