Back
paper
transformer-circuits.pub·transformer-circuits.pub/2024/scaling-monosemanticity/
Data Status
Not fetched
Cited by 13 pages
| Page | Type | Quality |
|---|---|---|
| Is Interpretability Sufficient for Safety? | Crux | 49.0 |
| Why Alignment Might Be Easy | Argument | 53.0 |
| Dense Transformers | Concept | 58.0 |
| AI Safety Intervention Effectiveness Matrix | Analysis | 73.0 |
| AI Scaling Laws | Concept | 92.0 |
| Anthropic | Organization | 74.0 |
| Chris Olah | Person | 27.0 |
| Anthropic Core Views | Safety Agenda | 62.0 |
| Interpretability | Safety Agenda | 66.0 |
| Mechanistic Interpretability | Approach | 59.0 |
| Sleeper Agent Detection | Approach | 66.0 |
| Sparse Autoencoders (SAEs) | Approach | 91.0 |
| Technical AI Safety Research | Crux | 66.0 |
Cached Content Preview
HTTP 200Fetched Feb 23, 202650 KB
[Transformer Circuits Thread](https://transformer-circuits.pub/)
# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

### Authors
Adly Templeton\*, Tom Conerly\*,Jonathan Marcus,Jack Lindsey,Trenton Bricken,Brian Chen,Adam Pearce,Craig Citro,Emmanuel Ameisen,Andy Jones,Hoagy Cunningham,Nicholas L Turner,Callum McDougall,Monte MacDiarmid,Alex Tamkin,Esin Durmus,Tristan Hume,Francesco Mosconi,C. Daniel Freeman,Theodore R. Sumers,Edward Rees,Joshua Batson,Adam Jermyn,Shan Carter,Chris Olah,Tom Henighan
### Affiliations
[Anthropic](https://www.anthropic.com/)
### Published
May 21, 2024
\\* Core Contributor;Correspondence to [henighan@anthropic.com](https://transformer-circuits.pub/2024/scaling-monosemanticity/henighan@anthropic.com);[Author contributions statement below](https://transformer-circuits.pub/2024/scaling-monosemanticity/#appendix-author-contributions).
Eight months ago, we [demonstrated](https://transformer-circuits.pub/2023/monosemantic-features/index.html) that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,
1
For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.
We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to [security vulnerabilities and backdoors in code](https://transformer-circuits.pub/2024/scaling-monosemanticity/#safety-relevant-code); [bias](https://transformer-circuits.pub/2024/scaling-monosemanticity/#safety-relevant-bias) (including
... (truncated, 50 KB total)Resource ID:
e724db341d6e0065 | Stable ID: ZWFmNWM5Mz