Sparse Autoencoders Find Highly Interpretable Features in Language Models
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Abstract
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI-Assisted Alignment | Approach | 63.0 |
| Interpretability | Safety Agenda | 66.0 |
| Sparse Autoencoders (SAEs) | Approach | 91.0 |
Cached Content Preview
# Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham 12, Aidan Ewart∗13, Logan Riggs∗1, Robert Huben, Lee Sharkey4
1EleutherAI, 2MATS, 3Bristol AI Safety Centre, 4Apollo Research
{hoagycunningham, aidanprattewart, logansmith5}@gmail.comEqual contribution
###### Abstract
One of the roadblocks to a better understanding of neural networks’ internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task (Wang et al., [2022](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib26 "")) to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
## 1 Introduction
Advances in artificial intelligence (AI) have resulted in the development of highly capable AI systems that make decisions for reasons we do not understand. This has caused concern that AI systems that we cannot trust are being widely deployed in the economy and in our lives, introducing a number of novel risks (Hendrycks et al., [2023](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib15 "")), including potential future risks that AIs might deceive humans in order to accomplish undesirable goals (Ngo et al., [2022](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib19 "")). Mechanistic interpretability seeks to mitigate such risks through understanding how neural networks calculate their outputs, allowing us to reverse engineer parts of their internal processes and make targeted changes to them (Cammarata et al., [2021](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib3 ""); Wang et al., [2022](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib26 ""); Elhage et al., [2021](https://ar5iv.labs.arxiv.org/html/2309.08600#bib.bib7 "")).
00footnotetext: Code to replicate experiments can be found at [https://github.com/HoagyC/sparse\_cod
... (truncated, 71 KB total)8aae7b9df41d1455 | Stable ID: ODM0YmQ5NW