Sparse Autoencoders Find Highly Interpretable Features in Language Models

paper

2023·arXiv·arxiv.org/abs/2309.08600

Authors

Hoagy Cunningham·Aidan Ewart·Logan Riggs·Robert Huben·Lee Sharkey

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Addresses neural network interpretability by using sparse autoencoders to decompose polysemantic neurons into interpretable features, directly advancing AI safety research on understanding and controlling model internals.

Paper Details

Citations

89 influential

Year

2024

arXiv:2309.08600 DOI:10.1101/2024.11.14.623630 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

Summary

This paper addresses polysemanticity in neural networks—where individual neurons activate across multiple unrelated contexts—by proposing sparse autoencoders to identify interpretable features in language models. The authors hypothesize that polysemanticity arises from superposition, where networks represent more features than neurons by using overcomplete directions in activation space. Their sparse autoencoder approach successfully recovers monosemantic (single-meaning) features that are more interpretable than existing methods, and demonstrates causal interpretability by identifying which features drive specific model behaviors on the indirect object identification task. This scalable, unsupervised method offers a foundation for mechanistic interpretability research and improved model transparency.

Cited by 3 pages

Page	Type	Quality
AI-Assisted Alignment	Approach	63.0
Interpretability	Research Area	66.0
Sparse Autoencoders (SAEs)	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202655 KB

[2309.08600] Sparse Autoencoders Find Highly Interpretable Features in Language Models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Sparse Autoencoders Find Highly Interpretable Features in Language Models

 
 
 Hoagy Cunningham   12 ,  Aidan Ewart ∗13 ,  Logan Riggs ∗1 ,  Robert Huben,  Lee Sharkey 4 
 1 EleutherAI, 2 MATS, 3 Bristol AI Safety Centre, 4 Apollo Research 
 {hoagycunningham, aidanprattewart, logansmith5}@gmail.com 
 Equal contribution 
 

 
 Abstract

 One of the roadblocks to a better understanding of neural networks’ internals is polysemanticity , where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition , where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task (Wang et al., 2022 ) to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

 
 
 
 1 Introduction

 
 Advances in artificial intelligence (AI) have resulted in the development of highly capable AI systems that make decisions for reasons we do not understand. This has caused concern that AI systems that we cannot trust are being widely deployed in the economy and in our lives, introducing a number of novel risks (Hendrycks et al., 2023 ) , including potential future risks that AIs might deceive humans in order to accomplish undesirable goals (Ngo et al., 2022 ) . Mechanistic interpretability seeks to mitigate such risks through understanding how neural networks calculate their outputs, allowing us to reverse engineer parts of their internal processes and make targeted changes to them (Cammarata et al., 2021 ; Wang et al., 2022 ; Elhage et al., 2021 ) .

 
 0 0 footnotetext: Code to replicate experiments can be found at https://github.com/HoagyC/sparse_coding 
 
 To reverse engineer a neural network, it is necessary to break it down into smaller units (features) that can be analysed in isolation. Using individual neurons as these units has had som

... (truncated, 55 KB total)

Resource ID: 8aae7b9df41d1455 | Stable ID: sid_eO9f93MnIn