Sparse Autoencoders

Interpretabilityactive

Using sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research.

Organizations

Key Papers

Grants

Total Funding

$687K

First Proposed: 2023 (Cunningham et al.; Anthropic)

Cluster: Interpretability

Grants9

Name	Recipient	Amount	Funder	Date
UC Berkeley — Study on Frontier Model Behavior	University of California, Berkeley	$500K	Coefficient Giving	2025-06
Exploring the feasibility of circuit-style analysis on the level of SAE features (MATS extension)	Lucy Farnik	$41K	Long-Term Future Fund (LTFF)	2024-01
6 month salary for further pursuing sparse autoencoders for automatic feature finding	Logan Smith	$40K	Long-Term Future Fund (LTFF)	2023-07
6-month stipend for Sparse Autoencoder Mech Interp projects	Logan Smith	$40K	Long-Term Future Fund (LTFF)	2024-01
6 month stipend for SAE-circuits	Logan Smith	$40K	Long-Term Future Fund (LTFF)	2024-07
Understanding SAE features using Sparse Feature Circuits	Lovis Heindrich	$11K	Manifund	2024-06-28
Exploring feature interactions in transformer LLMs through sparse autoencoders	Kunvar Thaman	$8.5K	Manifund	2023-12-01
Train great open-source sparse autoencoders	Tom McGrath	$4K	Manifund	2024-05-09
Neuronpedia - Open Interpretability Platform	Johnny Lin	$2.5K	Manifund	2023-07-26

SEMINAL

Cunningham et al.2023