Back
paper
transformer-circuits.pub·transformer-circuits.pub/
Data Status
Not fetched
Cited by 10 pages
| Page | Type | Quality |
|---|---|---|
| Mesa-Optimization Risk Analysis | Analysis | 61.0 |
| Anthropic | Organization | 74.0 |
| Conjecture | Organization | 37.0 |
| Chris Olah | Person | 27.0 |
| Dario Amodei | Person | 41.0 |
| AI Alignment | Approach | 91.0 |
| Anthropic Core Views | Safety Agenda | 62.0 |
| Interpretability | Safety Agenda | 66.0 |
| Mechanistic Interpretability | Approach | 59.0 |
| Probing / Linear Probes | Approach | 55.0 |
Cached Content Preview
HTTP 200Fetched Feb 23, 202612 KB
Transformer Circuits Thread
[Transformer Circuits Thread](https://transformer-circuits.pub/)
# Anthropic’s Interpretability Research
A surprising fact about modern large language models is that nobody really knows how they work
internally.
The Interpretability team strives to change that — to understand these models to better plan for a
future of
safe AI.
December 2025
[**Circuits Cross-Post — Activation Oracles**\\
\\
We train language models to answer questions about their own activations in natural language.](https://alignment.anthropic.com/2025/activation-oracles)
November 2025
[**Circuits Updates — November 2025**\\
\\
A short update on harm pressure.](https://transformer-circuits.pub/2025/november-update/index.html)
October 2025
[**Emergent Introspective Awareness in Large Language Models** \\
Lindsey, 2025\\
\\
We find evidence that language models can introspect on their internal states.](https://transformer-circuits.pub/2025/introspection/index.html) [**Circuits Updates — October 2025**\\
\\
Small updates on visual features and dictionary initialization.](https://transformer-circuits.pub/2025/october-update/index.html) [**When Models Manipulate Manifolds: The Geometry of a Counting Task** \\
Gurnee et al., 2025\\
\\
We find geometric structure underlying the mechanisms of a fundamental language model behavior.](https://transformer-circuits.pub/2025/linebreaks/index.html)
September 2025
[**Circuits Updates — September 2025**\\
\\
A small update on features and in-context learning.](https://transformer-circuits.pub/2025/september-update/index.html)
August 2025
[**Circuits Updates — August 2025**\\
\\
A small update: How does a persona modify the assistant’s response?](https://transformer-circuits.pub/2025/august-update/index.html)
July 2025
[**A Toy Model of Mechanistic (Un)Faithfulness**\\
\\
When transcoders go awry.](https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html) [**Tracing Attention Computation Through Feature Interactions** \\
Kamath et al., 2025\\
\\
We describe and apply a method to explain attention patterns in terms of\\
feature interactions, and integrate this information into attribution graphs.](https://transformer-circuits.pub/2025/attention-qk/index.html) [**A Toy Model of Interference Weights**\\
\\
Unpacking "interference weights" in some more depth.](https://transformer-circuits.pub/2025/interference-weights/index.html) [**Sparse mixtures of linear transforms**\\
\\
We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders.](https://transformer-circuits.pub/2025/bulk-update/index.html) [**Circuits Updates — July 2025**\\
\\
A collection of small updates: revisiting A Mathematical Framework and applications of\\
interpretability to biology.](https://transformer-circuits.pub/
... (truncated, 12 KB total)Resource ID:
5083d746c2728ff2 | Stable ID: MGJhYTk1MW