Longterm Wiki

Mechanistic Interpretability

Interpretabilityactive

Reverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior.

Organizations
2
Key Papers
2
First Proposed: 2020 (Olah et al., Anthropic)
Cluster: Interpretability
Parent Area: Interpretability

Tags

function:assurancescope:sub-field

Organizations2

OrganizationRole
Anthropicpioneer
Google DeepMindactive

Key Papers & Resources2

SEMINAL

Sub-Areas3

NameStatusOrgsPapers
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations.emerging00
Sparse AutoencodersUsing sparse dictionary learning to extract interpretable features from model activations at scale.active01
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research.active00