Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

Data Status

Not fetched

Cited by 3 pages

PageTypeQuality
Dense TransformersConcept58.0
InterpretabilitySafety Agenda66.0
Mechanistic InterpretabilityApproach59.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202630 KB
[Transformer Circuits Thread](https://transformer-circuits.pub/)

# Circuits Updates - July 2025

### Authors

### Affiliations

### Published

_Not published yet._

### DOI

_No DOI yet._

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

- [Revisiting](https://transformer-circuits.pub/2025/july-update/index.html#math) [A Mathematical Framework](https://transformer-circuits.pub/2025/july-update/index.html#math) [with the Language of Features](https://transformer-circuits.pub/2025/july-update/index.html#math)
- [Applications of Interpretability to Biology](https://transformer-circuits.pub/2025/july-update/index.html#bio)

* * *

## [Revisiting A Mathematical Framework with the Language of Features](https://transformer-circuits.pub/2025/july-update/index.html\#math)

Chris Olah; edited by Adam Jermyn

When we wrote [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html), we had no way to extract features from superposition. As a result, many ways of thinking about transformers which might most naturally be described in terms of features were instead described in terms of the eigenvalues of different matrix multiplies. Here we revisit some of these ideas in the language of features, and hopefully make the work clearer as a result.

Every attention head can be understood in terms of two matrices: the OV circuit (which describes what information the attention head reads and writes) and the QK circuit (which describes where to attend). We can describe both of these matrices in terms of features.

In describing these, we'll make heavy use of transformed sets of features. For example, we might have a feature X which detects some property, a feature prev(X) which detects that X was present at the previous token, and a feature say(X) which causes the model to produce an output that triggers X. These transformed features have a 1:1 correspondence with the corresponding original feature, and can be thought of as that feature with some relation applied. (In practice, we think the directions corresponding to these original features will often be a linearly transformed version of the original features. For motivation around this, see [Wattenberg & Viegas 2024](https://arxiv.org/pdf/2407.14662) on matrix binding and "echo" features.)

#### Copy Heads

To start, let's consider the OV circuit of a [copy head](https://transformer-circuits.pub/2021/framework/index.html#copying--primitive-in-context-learning). We expect it to lo

... (truncated, 30 KB total)
Resource ID: 0a2ab4f291c4a773 | Stable ID: NDJiODgxOT