Longterm Wiki

Sparse Autoencoders

Interpretabilityactive

Using sparse dictionary learning to extract interpretable features from model activations at scale.

Key Papers
1
First Proposed: 2023 (Cunningham et al.)
Cluster: Interpretability

Tags

function:assurancescope:technique

Key Papers & Resources1