beat SAEs for interpretability

paper

2025·arXiv·arxiv.org/html/2501.18823v1

Authors

Gonçalo Paulo·Stepan Shabalin·Nora Belrose

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.

Cited by 1 page

Page	Type	Quality
Interpretability	Safety Agenda	66.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202641 KB

[License: CC BY 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2501.18823v1 \[cs.LG\] 31 Jan 2025

# Transcoders Beat Sparse Autoencoders for Interpretability

Report issue for preceding element

Gonçalo Paulo
Stepan Shabalin
Nora Belrose

Report issue for preceding element

###### Abstract

Report issue for preceding element

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose _skip transcoders_, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.

Report issue for preceding element

Machine Learning, ICML

## 1 Introduction

Report issue for preceding element

Recently, large language models have achieved human-level reasoning performance in many tasks (Guo et al., [2025](https://arxiv.org/html/2501.18823v1#bib.bib22 "")). Interpretability aims to improve the safety and reliability of these systems by understanding their internal mechanisms and representations. While early research attempted to produce natural language explanations of individual neurons (Olah et al., [2020](https://arxiv.org/html/2501.18823v1#bib.bib33 ""); Gurnee et al., [2023](https://arxiv.org/html/2501.18823v1#bib.bib23 ""), [2024](https://arxiv.org/html/2501.18823v1#bib.bib24 "")), it is now widely recognized that most neurons are “polysemantic”, activating in semantically diverse contexts (Arora et al., [2018](https://arxiv.org/html/2501.18823v1#bib.bib1 ""); Elhage et al., [2022](https://arxiv.org/html/2501.18823v1#bib.bib17 "")).

Report issue for preceding element

Sparse autoencoders (SAEs) have emerged as a promising tool for partially overcoming polysemanticity, by decomposing activations into interpretable features (Bricken et al., [2023a](https://arxiv.org/html/2501.18823v1#bib.bib8 ""); Templeton et al., [2024b](https://arxiv.org/html/2501.18823v1#bib.bib39 ""); Gao et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib20 "")). SAEs are single hidden layer neural networks trained with the objective of reconstructing activations with a sparsity penalty (Bricken et al., [2023a](https://arxiv.org/html/2501.18823v1#bib.bib8 ""); Rajamanoharan et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib36 "")), sparsity constraint (Gao et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib20 ""); Bussmann et al., [2024](https://arxiv.org/html/2501.18823v1#bib.bib11 "")), or an information bottleneck (Ayonrinde et al., [2024](https://ar

... (truncated, 41 KB total)

Resource ID: 158f5d304b1dbcdd | Stable ID: NGNhMjM2NG