Circuits Updates - July 2025

web

Transformer Circuits·transformer-circuits.pub/2025/july-update/index.html

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

This is a July 2025 progress update from Anthropic's Transformer Circuits research thread, a leading mechanistic interpretability research program; best read alongside earlier foundational papers from the same series.

Metadata

Importance: 55/100working papernews

Summary

A research update from Anthropic's Transformer Circuits team summarizing recent progress in mechanistic interpretability, including advances in sparse autoencoders, feature analysis, and circuit-level understanding of transformer models. The update likely covers new findings on how features and circuits interact in large language models.

Key Points

•Provides periodic research updates from Anthropic's mechanistic interpretability team working on transformer circuits
•Covers ongoing work on sparse autoencoders (SAEs) as tools for decomposing neural network representations into interpretable features
•Updates on circuit-level analysis tracking how information flows and transforms through attention and MLP layers
•Part of a series of incremental research communications from the transformer-circuits.pub research thread

Cited by 3 pages

Page	Type	Quality
Dense Transformers	Concept	58.0
Interpretability	Research Area	66.0
Mechanistic Interpretability	Research Area	59.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202624 KB

Circuits Updates - July 2025 
 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 Circuits Updates - July 2025

 
 

 We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

 We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper. 

 
 New Posts 

 Revisiting A Mathematical Framework  with the Language of Features 
 Applications of Interpretability to Biology 
 
 

 

 Revisiting A Mathematical Framework  with the Language of Features 

 Chris Olah; edited by Adam Jermyn 
 When we wrote A Mathematical Framework for Transformer Circuits , we had no way to extract features from superposition. As a result, many ways of thinking about transformers which might most naturally be described in terms of features were instead described in terms of the eigenvalues of different matrix multiplies. Here we revisit some of these ideas in the language of features, and hopefully make the work clearer as a result.

 Every attention head can be understood in terms of two matrices: the OV circuit (which describes what information the attention head reads and writes) and the QK circuit (which describes where to attend). We can describe both of these matrices in terms of features.

 In describing these, we'll make heavy use of transformed sets of features. For example, we might have a feature X which detects some property, a feature prev(X) which detects that X was present at the previous token, and a feature say(X) which causes the model to produce an output that triggers X. These transformed features have a 1:1 correspondence with the corresponding original feature, and can be thought of as that feature with some relation applied. (In practice, we think the directions corresponding to these original features will often be a linearly transformed version of the original features. For motivation around this, see Wattenberg & Viegas 2024  on matrix binding and "echo" features.)

 Copy Heads

 To start, let's consider the OV circuit of a copy head . We expect it to look something like this, converting seeing X to saying X:

 
 In Framework , we used positive eigenvalues of W_U W_{OV} W_E (that is embed tokens, put them through OV, and then through the token unembedings) as a sign that roughly this was going on. Why was that a reasonable thing to do? Well, the embeddings give a basis for something like very simple features corresponding to each token in very early layers, while the unembeddings give the corresponding "say(X)" features in very late layers. So, especially in small models, we can use them as a kind of basis for both of these s

... (truncated, 24 KB total)

Resource ID: 0a2ab4f291c4a773 | Stable ID: sid_aTSChGdYz0