Interpretability

Interpretabilityactive

Research field focused on understanding how AI models work internally, encompassing all approaches to model transparency.

Organizations

Grants

114

Total Funding

$111M

Risks Addressed

Cluster: Interpretability

Organizations3

Organization	Role
Anthropic	pioneer
Google DeepMind	active
Redwood Research	active

GrantsTop 50 of 114

Name	Recipient	Amount	Funder	Date
MATS Research — General Support	MATS ML Alignment Theory Scholars program	$24M	Coefficient Giving	2025-05
Against Malaria Foundation — General Support (January 2017)	Against Malaria Foundation	$15M	Coefficient Giving	2017-01
Schistosomiasis Control Initiative — General Support (2017)	Schistosomiasis Control Initiative	$14M	Coefficient Giving	2017-01
Malaria Consortium — Seasonal Malaria Chemoprevention (February 2017)	Malaria Consortium	$5M	Coefficient Giving	2017-02
The END Fund — Deworming (December 2016)	The END Fund	$5M	Coefficient Giving	2016-12
Deworm the World Initiative — General Support (January 2017)	Evidence Action	$4.5M	Coefficient Giving	2017-01
MATS Research — MATS Program 6.0 + 7.0	MATS ML Alignment Theory Scholars program	$3.4M	Coefficient Giving	2024-06
Sightsavers — Deworming (January 2017)	Sightsavers	$3M	Coefficient Giving	2017-01
Eleuther AI — Interpretability Research	Eleuther AI	$2.6M	Coefficient Giving	2023-11
Berkeley Existential Risk Initiative — MATS (November 2023)	Berkeley Existential Risk Initiative	$2.6M	Coefficient Giving	2023-11
GiveDirectly — General Support (January 2017)	GiveDirectly	$2.5M	Coefficient Giving	2017-01
AI Safety Support — MATS Program 6.0 + 7.0	AI Safety Support	$2.4M	Coefficient Giving	2024-05
Apollo Research — General Support	Apollo Research	$2.2M	Coefficient Giving	2024-05
Apollo Research — Startup Funding	Apollo Research	$1.5M	Coefficient Giving	2023-06
Drug Policy Alliance — Drug Decriminalization	Drug Policy Alliance	$1.3M	Coefficient Giving	2017-03
University of California, Berkeley — PAVITRA Pollution Modeling Tool	University of California, Berkeley	$1.3M	Coefficient Giving	2022-09
Center for Popular Democracy — Fed Up Campaign (2018)	Center for Popular Democracy	$1.2M	Coefficient Giving	2018-02
UC Berkeley — AI Safety Research (2018)	University of California, Berkeley	$1.1M	Coefficient Giving	2018-11
Center for Popular Democracy — Fed Up Campaign (2017)	Center for Popular Democracy	$1.1M	Coefficient Giving	2017-02
Northeastern University — Large Language Model Interpretability Research (2024)	Northeastern University	$1.1M	Coefficient Giving	2024-05
Harvard University — AI Interpretability, Controllability, and Safety Research	Harvard University	$1M	Coefficient Giving	2024-01
theguardian.org — Journalism on Factory Farming and Animal Cruelty (2020)	theguardian.org	$900K	Coefficient Giving	2020-01
theguardian.org — Journalism on Factory Farming and Farm Animal Cruelty	theguardian.org	$887K	Coefficient Giving	2017-12
Guide Labs — Open Access Interpretability Project	Guide Labs	$750K	Coefficient Giving	2023-08
Stanford University — AI Interpretability Research	Stanford University	$744K	Coefficient Giving	2025-04
Practical AI Alignment and Interpretability Research Group — Interpretability Work	Practical AI Alignment and Interpretability Research Group	$737K	Coefficient Giving	2024-09
AI Safety Support — MATS Program (November 2023)	AI Safety Support	$733K	Coefficient Giving	2023-11
AI Safety Support — MATS Program Extension	AI Safety Support	$710K	Coefficient Giving	2024-10
MATS Research — AI Safety Research Expenses	MATS ML Alignment Theory Scholars program	$660K	Coefficient Giving	2024-12
Indian Institute of Technology Bombay — PAVITRA Pollution Modeling Tool	Indian Institute of Technology Bombay	$648K	Coefficient Giving	2022-11
Center for Popular Democracy — Fed Up Campaign (2019)	Center for Popular Democracy	$600K	Coefficient Giving	2019-11
Center for Study of Science, Technology and Policy — PAVITRA Pollution Modeling Tool	CSTEP	$565K	Coefficient Giving	2022-09
Northeastern University — Large Language Model Interpretability Research	Northeastern University	$562K	Coefficient Giving	2022-11
Alliance for Open Society International: Albanian Renaissance	Alliance for Open Society International	$500K	Coefficient Giving	2014-07
Center for Popular Democracy — Fed Up Campaign (2020)	Center for Popular Democracy	$465K	Coefficient Giving	2020-06
theguardian.org – Journalism on Factory Farming and Animal Cruelty (2021)	theguardian.org	$450K	Coefficient Giving	2021-12
University of Washington — PAVITRA Pollution Modeling Tool	University of Washington	$443K	Coefficient Giving	2022-09
The Kids Research Institute Australia — Malaria Vaccine Modelling	The Kids Research Institute Australia	$410K	Coefficient Giving	2024-10
Apollo Research: Scale up interpretability & behavioral model evals research	Apollo Research	$339K	Manifund	2023-07-20
Center for Popular Democracy Action Fund — Fed Up Campaign (2017)	Center for Popular Democracy Action Fund	$305K	Coefficient Giving	2017-02
Cooperation Canada — International Assistance Advocacy in Canada	Cooperation Canada	$300K	Coefficient Giving	2023-01
American Civil Liberties Union of Northern California — Prosecutorial Accountability	American Civil Liberties Union Foundation of Northern California	$250K	Coefficient Giving	2017-02
Leap Labs — Interpretability Research	Leap Labs	$230K	Coefficient Giving	2023-04
University of Maryland — Study on Encoded Reasoning in LLMs	University of Maryland	$218K	Coefficient Giving	2025-06
Charity Science: Health — General Support	Charity Science Health	$200K	Coefficient Giving	2016-11
One year of seed funding for a new AI interpretability research organisation	Jessica Rumbelow	$195K	Long-Term Future Fund (LTFF)	2023-01
Livestock Welfare Strategies — Improved Chicken Depopulation Welfare	Livestock Welfare Strategies	$165K	Coefficient Giving	2025-05
Scoping Developmental Interpretability	Jesse Hoogland	$145K	Manifund	2023-07-18
The Centre for Economic Policy Research — Macroeconomic Impacts of Cash Transfers	The Centre for Economic Policy Research	$125K	Coefficient Giving	2024-12
FAR.AI — AI Interpretability Research	FAR AI	$100K	Coefficient Giving	2023-03

Funding by Funder

Funder	Grants	Total Amount
Coefficient Giving	58	$108M
Long-Term Future Fund (LTFF)	41	$1.7M
Manifund	14	$667K
FTX Future Fund	1	$30K

Sub-Areas7

Name	Status	Orgs	Papers
Activation MonitoringUsing probes and monitors on model internals to detect deception, harmful intent, or anomalous reasoning in real time.	emerging	3	0
Externalizing ReasoningEnsuring AI systems' chain-of-thought reasoning faithfully reflects their actual internal computations.	emerging	3	0
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods.	emerging	3	0
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features.	active	3	0
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level.	active	5	2
Representation EngineeringControlling AI behavior by directly manipulating internal representations, including activation addition and steering vectors.	emerging	3	1
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable, as opposed to post-hoc interpretation of opaque models.	emerging	3	0

Interpretability

Organizations3

GrantsTop 50 of 114

Funding by Funder

Sub-Areas7

Tags