Mechanistic Interpretability

Interpretabilityactive

Reverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level.

Organizations

Key Papers

Grants

Total Funding

$1.6M

First Proposed: 2020 (Olah et al., Anthropic)

Cluster: Interpretability

Parent Area: Interpretability

Organizations2

Organization	Role
Anthropic	pioneer
Google DeepMind	active

Grants28

Name	Recipient	Amount	Funder	Date
Practical AI Alignment and Interpretability Research Group — Interpretability Work	Practical AI Alignment and Interpretability Research Group	$737K	Coefficient Giving	2024-09
Northeastern University — Mechanistic Interpretability Research	Northeastern University	$72K	Coefficient Giving	2023-09
This grant is for Nathaniel Monson to spend 6 months studying to transition to AI alignment research, with a focus on methods for mechanistic interpretability and resolving polysemanticity.	Nathaniel Monson	$70K	Long-Term Future Fund (LTFF)	2023-04
4-month stipend for MATS extension on mechanistic interpretability benchmark + 2-month stipend for career switch	Iván Arcuschin Moreno	$67K	Long-Term Future Fund (LTFF)	2024-01
WhiteBox Research: 1.9 FTE for 9 months to pilot a training program in Manila focused on Mechanistic Interpretability	Brian Tan	$61K	Long-Term Future Fund (LTFF)	2023-07
4-months stipend to apply mechanistic interpretability to a real-world application, hallucinations	Javier Ferrando Monsonís and Oscar Balcells Obeso	$60K	Long-Term Future Fund (LTFF)	2024-07
4-6 month salary to do circuit-based mech interp on Mamba, as part of the MATS extension program	Danielle Ensign	$60K	Long-Term Future Fund (LTFF)	2024-01
6-month salary to build & enhance open-source mechanistic interpretability tooling for AI safety researchers	Bryce Meyer	$50K	Long-Term Future Fund (LTFF)	2023-04
This grant is funding a 6-month stipend for Bilal Chughtai to work on a mechanistic interpretability project	Bilal Chughtai	$48K	Long-Term Future Fund (LTFF)	2023-01
Exploring the feasibility of circuit-style analysis on the level of SAE features (MATS extension)	Lucy Farnik	$41K	Long-Term Future Fund (LTFF)	2024-01
6 month salary to work on mech interp research with mentorship from Prof David Bau	Bilal Chughtai	$41K	Long-Term Future Fund (LTFF)	2023-07
6-month stipend for Sparse Autoencoder Mech Interp projects	Logan Smith	$40K	Long-Term Future Fund (LTFF)	2024-01
6-month salary and compute budget for continuing work on mechanistic interpretability for attention layers	Keith Wynroe	$37K	Long-Term Future Fund (LTFF)	2024-07
This grant is funding for a 6-month stipend for Sviatoslav Chalnev to work on independent interpretability research, specifically mechanistic interpretability and open-source tooling for interpretability research.	Sviatoslav Chalnev	$35K	Long-Term Future Fund (LTFF)	2023-04
Sincxpress Education — Mechanistic Interpretability Course Development	Sincxpress Education	$30K	Coefficient Giving	2025-03
Guaranteed Safe AI Seminars 2026	Orpheus Lummis	$30K	Manifund	2025-10-16
6-month salary for an AISC project and continuing independent mechanistic interpretability projects	Christopher Mathwin	$28K	Long-Term Future Fund (LTFF)	2023-04
Stipend for a master’s thesis and paper on technical alignment research: mechanistic interpretability of attention	Matthias Dellago	$25K	Long-Term Future Fund (LTFF)	2023-04
Automatic circuit discovery on sparse autoencoded features	Can Rager	$25K	Manifund	2023-12-30
4-month stipend to bring to completion a mechanistic interpretability research project on how neural networks perform co	Stanford University	$22K	Long-Term Future Fund (LTFF)	2024-07
5-month funding to continue upskilling in mechanistic interpretability post-SERI MATs, and to continue open projects	Keith Wynroe	$22K	Long-Term Future Fund (LTFF)	2023-07
WhiteBox Research: Training Exclusively for Mechanistic Interpretability	Brian Tan	$12K	Manifund	2023-08-20
Mechanistic Interpretability research for unfaithful chain-of-thought (1 month)	Iván Arcuschin Moreno	$11K	Manifund	2024-11-19
Mapping neuroscience and mechanistic interpretability	Zhonghao He	$6K	Manifund	2023-12-09
2 month funding to get into mechanistic interpretability and to do 2-3 projects, than learn briefly related fields	Krzysztof Gwiazda	$5K	Long-Term Future Fund (LTFF)	2024-07
Discovering latent goals (mechanistic interpretability PhD salary)	Lucy Farnik	$1.6K	Manifund	2023-07-10
The First Workshop on Mechanistic Interpretability for Vision	tamar rott shaham	$1.5K	Manifund	2025-05-28
Enabling a Student to Present his Mechanistic Interpretability Work at NeurIPS	Chris Wendler	$1.1K	Manifund	2025-10-10

Funding by Funder

Funder	Grants	Total Amount
Coefficient Giving	3	$839K
Long-Term Future Fund (LTFF)	17	$713K
Manifund	8	$89K

Key Papers & Resources2

SEMINAL

Zoom In: An Introduction to Circuits

Olah et al.2020

SEMINAL

Scaling Monosemanticity

Anthropic2024

Sub-Areas3

Name	Status	Orgs	Papers
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations.	emerging	2	2
Sparse AutoencodersUsing sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research.	active	2	3
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research.	active	2	2