circuit tracing research

web

Transformer Circuits·transformer-circuits.pub/2025/attribution-graphs/biology....

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

Data Status

Not fetched

Cited by 1 page

Page	Type	Quality
Interpretability	Safety Agenda	66.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202698 KB

×

[Transformer Circuits Thread](https://transformer-circuits.pub/)

# On the Biology of a Large Language Model

# On the Biology of a Large Language Model

## We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

### Authors

Jack Lindsey†,Wes Gurnee\*,Emmanuel Ameisen\*,Brian Chen\*,Adam Pearce\*,Nicholas L. Turner\*,Craig Citro\*,

David Abrahams,Shan Carter,Basil Hosmer,Jonathan Marcus,Michael Sklar,Adly Templeton,

Trenton Bricken,Callum McDougall◊,Hoagy Cunningham,Thomas Henighan,Adam Jermyn,Andy Jones,Andrew Persic,Zhenyi Qi,T. Ben Thompson,

Sam Zimmerman,Kelley Rivoire,Thomas Conerly,Chris Olah,Joshua Batson\*‡

### Affiliations

[Anthropic](https://www.anthropic.com/)

### Published

March 27, 2025

† Lead Contributor;\\* Core Contributor;‡ Correspondence to [joshb@anthropic.com](mailto:joshb@anthropic.com);◊ Work performed while at Anthropic;[Author contributions statement below](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#appendix-author-contributions).

### Authors

### Affiliations

### Published

_Not published yet._

### DOI

_No DOI yet._

* * *

## § 1 [Introduction](https://transformer-circuits.pub/2025/attribution-graphs/biology.html\#introduction)

Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.

The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.

Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g.

- **Sparse Autoencoders Find Highly Interpretable Model Directions** [\[link\]](https://arxiv.org/pdf/2309.08600)

  H. Cunningham, A. Ewart, L. Smith, R. Huben, L. Sharkey.

  arXiv preprint arXiv:2309.08600. 2023.
- **Towards Monosemanticity: Decomposing Language Models With Dictionary Learning** [\[HTML\]](https://transformer-circuits.pub/2023/monosemantic-features/index.html)

  T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn

... (truncated, 98 KB total)

Resource ID: fbc2b9d822be9900 | Stable ID: ZWU0NWFhNz