beat SAEs for interpretability
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Abstract
Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Interpretability | Safety Agenda | 66.0 |
Cached Content Preview
[License: CC BY 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)
arXiv:2501.18823v1 \[cs.LG\] 31 Jan 2025
# Transcoders Beat Sparse Autoencoders for Interpretability
Report issue for preceding element
Gonçalo Paulo
Stepan Shabalin
Nora Belrose
Report issue for preceding element
###### Abstract
Report issue for preceding element
Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose _skip transcoders_, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
Report issue for preceding element
Machine Learning, ICML
## 1 Introduction
Report issue for preceding element
Recently, large language models have achieved human-level reasoning performance in many tasks (Guo et al., [2025](https://arxiv.org/html/2501.18823v1#bib.bib22 "")). Interpretability aims to improve the safety and reliability of these systems by understanding their internal mechanisms and representations. While early research attempted to produce natural language explanations of individual neurons (Olah et al., [2020](https://arxiv.org/html/2501.18823v1#bib.bib33 ""); Gurnee et al., [2023](https://arxiv.org/html/2501.18823v1#bib.bib23 ""), [2024](https://arxiv.org/html/2501.18823v1#bib.bib24 "")), it is now widely recognized that most neurons are “polysemantic”, activating in semantically diverse contexts (Arora et al., [2018](https://arxiv.org/html/2501.18823v1#bib.bib1 ""); Elhage et al., [2022](https://arxiv.org/html/2501.18823v1#bib.bib17 "")).
Report issue for preceding element
Sparse autoencoders (SAEs) have emerged as a promising tool for partially overcoming polysemanticity, by decomposing activations into interpretable features (Bricken et al., [2023a](https://arxiv.org/html/2501.18823v1#bib.bib8 ""); Templeton et al., [2024b](https://arxiv.org/html/2501.18823v1#bib.bib39 ""); Gao et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib20 "")). SAEs are single hidden layer neural networks trained with the objective of reconstructing activations with a sparsity penalty (Bricken et al., [2023a](https://arxiv.org/html/2501.18823v1#bib.bib8 ""); Rajamanoharan et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib36 "")), sparsity constraint (Gao et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib20 ""); Bussmann et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib11 "")), or an information bottleneck (Ayonrinde et al., [2024](https://ar
... (truncated, 41 KB total)158f5d304b1dbcdd | Stable ID: NGNhMjM2NG