Back
circuit tracing research
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Transformer Circuits
Data Status
Not fetched
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Interpretability | Safety Agenda | 66.0 |
Cached Content Preview
HTTP 200Fetched Feb 26, 202698 KB
×
[Transformer Circuits Thread](https://transformer-circuits.pub/)
# On the Biology of a Large Language Model
# On the Biology of a Large Language Model
## We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
### Authors
Jack Lindsey†,Wes Gurnee\*,Emmanuel Ameisen\*,Brian Chen\*,Adam Pearce\*,Nicholas L. Turner\*,Craig Citro\*,
David Abrahams,Shan Carter,Basil Hosmer,Jonathan Marcus,Michael Sklar,Adly Templeton,
Trenton Bricken,Callum McDougall◊,Hoagy Cunningham,Thomas Henighan,Adam Jermyn,Andy Jones,Andrew Persic,Zhenyi Qi,T. Ben Thompson,
Sam Zimmerman,Kelley Rivoire,Thomas Conerly,Chris Olah,Joshua Batson\*‡
### Affiliations
[Anthropic](https://www.anthropic.com/)
### Published
March 27, 2025
† Lead Contributor;\\* Core Contributor;‡ Correspondence to [joshb@anthropic.com](mailto:joshb@anthropic.com);◊ Work performed while at Anthropic;[Author contributions statement below](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#appendix-author-contributions).
### Authors
### Affiliations
### Published
_Not published yet._
### DOI
_No DOI yet._
* * *
## § 1 [Introduction](https://transformer-circuits.pub/2025/attribution-graphs/biology.html\#introduction)
Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.
The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.
Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g.
- **Sparse Autoencoders Find Highly Interpretable Model Directions** [\[link\]](https://arxiv.org/pdf/2309.08600)
H. Cunningham, A. Ewart, L. Smith, R. Huben, L. Sharkey.
arXiv preprint arXiv:2309.08600. 2023.
- **Towards Monosemanticity: Decomposing Language Models With Dictionary Learning** [\[HTML\]](https://transformer-circuits.pub/2023/monosemantic-features/index.html)
T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn
... (truncated, 98 KB total)Resource ID:
fbc2b9d822be9900 | Stable ID: ZWU0NWFhNz