Transformer Circuits Thread
paperCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Transformer Circuits
This is the canonical landing page for Anthropic's mechanistic interpretability research program; it serves as an index to all Transformer Circuits papers and updates and is essential reading for anyone studying AI internals for safety purposes.
Metadata
Summary
The Transformer Circuits Thread is Anthropic's primary publication hub for mechanistic interpretability research on large language models. It hosts foundational and ongoing research aimed at understanding the internal workings of transformer models, including work on circuits, features, sparse autoencoders, and attribution graphs. The thread represents a sustained research program toward making AI systems more understandable and safer.
Key Points
- •Central repository for Anthropic's mechanistic interpretability research, including landmark papers on circuits, superposition, and sparse autoencoders.
- •Hosts both major papers (e.g., 'Biology of a Large Language Model') and shorter monthly updates on incremental findings.
- •Research covers attention mechanisms, feature geometry, introspection in LLMs, transcoders, and attribution graphs.
- •Directly motivated by AI safety: the team aims to understand model internals to better plan for safe AI development.
- •Publishes accessible formats including toy models, cross-posts with alignment blog, and research updates.
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| Anthropic | Organization | 74.0 |
| Chris Olah | Person | 27.0 |
| Mechanistic Interpretability | Research Area | 59.0 |
| Probing / Linear Probes | Approach | 55.0 |
Cached Content Preview
Transformer Circuits Thread
-->
Transformer Circuits Thread
Anthropic’s Interpretability Research
A surprising fact about modern large language models is that nobody really knows how they work
internally.
The Interpretability team strives to change that — to understand these models to better plan for a
future of
safe AI.
April 2026
Emotion Concepts and their Function in a Large Language Model
Sofroniew et al., 2026
We find representations of emotion concepts in Claude Sonnet 4.5 and show that they causally influence its outputs.
December 2025
Circuits Cross-Post — Activation Oracles
We train language models to answer questions about their own activations in natural language.
November 2025
Circuits Updates — November 2025
A short update on harm pressure.
October 2025
Emergent Introspective Awareness in Large Language Models
Lindsey, 2025
We find evidence that language models can introspect on their internal states.
Circuits Updates — October 2025
Small updates on visual features and dictionary initialization.
When Models Manipulate Manifolds: The Geometry of a Counting Task
Gurnee et al., 2025
We find geometric structure underlying the mechanisms of a fundamental language model behavior.
September 2025
Circuits Updates — September 2025
A small update on features and in-context learning.
August 2025
Circuits Updates — August 2025
A small update: How does a persona modify the assistant’s response?
July 2025
A Toy Model of Mechanistic (Un)Faithfulness
When transcoders go awry.
Tracing Attention Computation Through Feature Interactions
Kamath et al., 2025
We describe and apply a method to explain attention patterns in terms of
feature interactions, and integrate this information into attribution graphs.
A Toy Model of Interference Weights
Unpacking "interference weights" in some more depth.
Sparse mixtures of linear transforms
We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders.
Circuits Updates — July 2025
A collection of small updates: revisiting A Mathematical Framework and applications of
interpretability to biology.
Automated Auditing
A note on using agents to perform automated alignment audits, including using interpretability
tools.
April 2025
Circuits Updates — April 2025
A collection of small updates: jailbreaks, dense features, and spinning up on interpretability.
Progress on Attention
An update on our progress studying attention.
March 2025
On the Biology of a Large Language Model
Lindsey et al., 2025
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight
production model — in a variety of contexts.
Circuit Tracing: Revealing Computational Graphs in Language Models
Ameisen et
... (truncated, 10 KB total)5083d746c2728ff2 | Stable ID: sid_Voqkpakv6h