Longterm Wiki
Back

Data Status

Not fetched

Cited by 13 pages

Cached Content Preview

HTTP 200Fetched Feb 23, 202650 KB
[Transformer Circuits Thread](https://transformer-circuits.pub/)

# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

![](<Base64-Image-Removed>)

### Authors

Adly Templeton\*, Tom Conerly\*,Jonathan Marcus,Jack Lindsey,Trenton Bricken,Brian Chen,Adam Pearce,Craig Citro,Emmanuel Ameisen,Andy Jones,Hoagy Cunningham,Nicholas L Turner,Callum McDougall,Monte MacDiarmid,Alex Tamkin,Esin Durmus,Tristan Hume,Francesco Mosconi,C. Daniel Freeman,Theodore R. Sumers,Edward Rees,Joshua Batson,Adam Jermyn,Shan Carter,Chris Olah,Tom Henighan

### Affiliations

[Anthropic](https://www.anthropic.com/)

### Published

May 21, 2024

\\* Core Contributor;Correspondence to [henighan@anthropic.com](https://transformer-circuits.pub/2024/scaling-monosemanticity/henighan@anthropic.com);[Author contributions statement below](https://transformer-circuits.pub/2024/scaling-monosemanticity/#appendix-author-contributions).

Eight months ago, we [demonstrated](https://transformer-circuits.pub/2023/monosemantic-features/index.html) that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,

1

For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.

We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

![](<Base64-Image-Removed>)

Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to [security vulnerabilities and backdoors in code](https://transformer-circuits.pub/2024/scaling-monosemanticity/#safety-relevant-code); [bias](https://transformer-circuits.pub/2024/scaling-monosemanticity/#safety-relevant-bias) (including

... (truncated, 50 KB total)
Resource ID: e724db341d6e0065 | Stable ID: ZWFmNWM5Mz