Sparse Autoencoders
InterpretabilityactiveUsing sparse dictionary learning to extract interpretable features from model activations at scale.
Key Papers
1
First Proposed: 2023 (Cunningham et al.)
Cluster: Interpretability
Parent Area: Mechanistic Interpretability
Tags
function:assurancescope:technique
Key Papers & Resources1
SEMINAL
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Cunningham et al.2023