Transcoders Beat Sparse Autoencoders for Interpretability
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Relevant to researchers using SAEs for mechanistic interpretability; challenges the dominance of SAEs as the go-to tool for understanding MLP computations in transformers, proposing transcoders as a superior alternative for circuit analysis.
Paper Details
Metadata
Abstract
Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
Summary
This paper argues that transcoders—which learn to decompose MLP computations into interpretable features by mapping inputs to outputs—outperform sparse autoencoders (SAEs) for mechanistic interpretability tasks. The authors demonstrate that transcoders better capture the functional role of neurons in circuits, enabling cleaner circuit analysis. The work suggests transcoders should be preferred over SAEs when the goal is understanding how computations are performed rather than just representing activations.
Key Points
- •Transcoders map MLP layer inputs to outputs using sparse, interpretable features, capturing functional computation rather than just activation patterns.
- •Empirical comparisons show transcoders outperform SAEs on circuit-level interpretability benchmarks, producing cleaner and more faithful circuit reconstructions.
- •SAEs reconstruct activations but may miss the computational role of features; transcoders directly model the input-output function of MLP blocks.
- •The paper provides evidence that the choice of interpretability tool significantly affects downstream circuit analysis quality.
- •Results suggest the field should revisit assumptions about SAEs as the default tool for mechanistic interpretability of transformer MLPs.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Interpretability | Research Area | 66.0 |
Cached Content Preview
Transcoders Beat Sparse Autoencoders for Interpretability
Transcoders Beat Sparse Autoencoders for Interpretability
Gonçalo Paulo
Stepan Shabalin
Nora Belrose
Abstract
Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders , which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
Machine Learning, ICML
1 Introduction
Recently, large language models have achieved human-level reasoning performance in many tasks (Guo et al., 2025 ) . Interpretability aims to improve the safety and reliability of these systems by understanding their internal mechanisms and representations. While early research attempted to produce natural language explanations of individual neurons (Olah et al., 2020 ; Gurnee et al., 2023 , 2024 ) , it is now widely recognized that most neurons are “polysemantic”, activating in semantically diverse contexts (Arora et al., 2018 ; Elhage et al., 2022 ) .
Sparse autoencoders (SAEs) have emerged as a promising tool for partially overcoming polysemanticity, by decomposing activations into interpretable features (Bricken et al., 2023a ; Templeton et al., 2024b ; Gao et al., 2024 ) . SAEs are single hidden layer neural networks trained with the objective of reconstructing activations with a sparsity penalty (Bricken et al., 2023a ; Rajamanoharan et al., 2024 ) , sparsity constraint (Gao et al., 2024 ; Bussmann et al., 2024 ) , or an information bottleneck (Ayonrinde et al., 2024 ) . They consist of two parts: an encoder that projects activations into a sparse, high-dimensional latent space, and a decoder that reconstructs the original activations from the latents.
Bricken et al. ( 2023a ) introduced a technique of evaluating the interpretability of SAEs by simulating them with an LLM-based scorer, similar to what had been done on neurons (Bills et al., 2023 ) .
This approach is commonly called automated interpretability, or autointerp. SAE features perform much better on this benchmark compared to neurons, even when neurons are “sparsified” by selecting only the top- k 𝑘 k italic_k most active neurons in a layer for analysis (Paulo et al., 2024 ) . One problem with SAEs is that they focus on compressing intermediate activations rather than modeling the functional behavior of network components (e.g., feedforward modules).
Figure 1: Skip
... (truncated, 34 KB total)158f5d304b1dbcdd | Stable ID: sid_xDv8eR7xei