Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

paper

Transformer Circuits·transformer-circuits.pub/2024/scaling-monosemanticity/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

A landmark Anthropic paper demonstrating that mechanistic interpretability via sparse autoencoders scales to large production LLMs like Claude 3 Sonnet, revealing millions of interpretable features and advancing the feasibility of understanding frontier AI internals.

Metadata

Importance: 90/100working paperprimary source

Summary

Anthropic applies sparse autoencoders (SAEs) to extract millions of interpretable, monosemantic features from Claude 3 Sonnet, a large production-scale language model. The work demonstrates that features learned at scale are human-interpretable, multimodal, and exhibit meaningful geometric structure including concept hierarchies. This represents a major step toward mechanistic interpretability of frontier AI systems.

Key Points

•Sparse autoencoders decompose Claude 3 Sonnet's residual stream into ~34 million features, many of which are human-interpretable and monosemantic.
•Features capture multimodal concepts (e.g., the same neuron responds to visual, linguistic, and abstract instances of a concept), suggesting high-level abstraction.
•Feature geometry reflects semantic relationships: related concepts cluster together and exhibit meaningful directional structure in activation space.
•Some features correspond to safety-relevant concepts (e.g., deception, manipulation, dangerous topics), suggesting potential for interpretability-based safety tools.
•Scaling SAEs to large production models validates earlier findings from smaller models and shows the approach transfers to frontier systems.

Cited by 8 pages

Page	Type	Quality
Dense Transformers	Concept	58.0
AI Scaling Laws	Concept	92.0
Anthropic	Organization	74.0
Chris Olah	Person	27.0
Mechanistic Interpretability	Research Area	59.0
Sleeper Agent Detection	Approach	66.0
Sparse Autoencoders (SAEs)	Approach	91.0
Technical AI Safety Research	Crux	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

 
 

 
 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

 
 

 
 
 
 Authors

 
 
 Adly Templeton * , 
 Tom Conerly * , 
 Jonathan Marcus, 
 Jack Lindsey, 
 Trenton Bricken, 
 Brian Chen, 
 Adam Pearce, 
 Craig Citro, 
 Emmanuel Ameisen, 
 Andy Jones, 
 Hoagy Cunningham, 
 Nicholas L Turner, 
 Callum McDougall, 
 Monte MacDiarmid, 
 Alex Tamkin, 
 Esin Durmus, 
 Tristan Hume, 
 Francesco Mosconi, 
 C. Daniel Freeman, 
 Theodore R. Sumers, 
 Edward Rees, 
 Joshua Batson, 
 Adam Jermyn, 
 Shan Carter, 
 Chris Olah, 
 Tom Henighan 
 
 
 
 Affiliations

 
 Anthropic 
 
 
 
 Published

 
 May 21, 2024 
 
 
 
 * Core Contributor; 
 Correspondence to henighan@anthropic.com ; 
 Author contributions statement below .
 
 
 
 
 
 

 
 

 
 Eight months ago, we demonstrated  that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet , For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.

 We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

 
 Some of the features we find are of particular interest because they may be safety-relevant  – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code ; bias  (including both overt slurs, and more subtle biases); lying, deception, and power-seeking  (including treacherous turns); sycophancy ; and dangerous / criminal content  (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a dif

... (truncated, 98 KB total)

Resource ID: e724db341d6e0065 | Stable ID: sid_biWEyeM0EL