Longterm Wiki
Back

paper
transformer-circuits.pub·transformer-circuits.pub/

Data Status

Not fetched

Cited by 10 pages

PageTypeQuality
Mesa-Optimization Risk AnalysisAnalysis61.0
AnthropicOrganization74.0
ConjectureOrganization37.0
Chris OlahPerson27.0
Dario AmodeiPerson41.0
AI AlignmentApproach91.0
Anthropic Core ViewsSafety Agenda62.0
InterpretabilitySafety Agenda66.0
Mechanistic InterpretabilityApproach59.0
Probing / Linear ProbesApproach55.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202612 KB
Transformer Circuits Thread

[Transformer Circuits Thread](https://transformer-circuits.pub/)

# Anthropic’s Interpretability Research

A surprising fact about modern large language models is that nobody really knows how they work
internally.
The Interpretability team strives to change that — to understand these models to better plan for a
future of
safe AI.

December 2025

[**Circuits Cross-Post — Activation Oracles**\\
\\
We train language models to answer questions about their own activations in natural language.](https://alignment.anthropic.com/2025/activation-oracles)

November 2025

[**Circuits Updates — November 2025**\\
\\
A short update on harm pressure.](https://transformer-circuits.pub/2025/november-update/index.html)

October 2025

[![](https://transformer-circuits.pub/images/introspection.png)**Emergent Introspective Awareness in Large Language Models** \\
Lindsey, 2025\\
\\
We find evidence that language models can introspect on their internal states.](https://transformer-circuits.pub/2025/introspection/index.html) [**Circuits Updates — October 2025**\\
\\
Small updates on visual features and dictionary initialization.](https://transformer-circuits.pub/2025/october-update/index.html) [![](https://transformer-circuits.pub/images/linebreaks.png)**When Models Manipulate Manifolds: The Geometry of a Counting Task** \\
Gurnee et al., 2025\\
\\
We find geometric structure underlying the mechanisms of a fundamental language model behavior.](https://transformer-circuits.pub/2025/linebreaks/index.html)

September 2025

[**Circuits Updates — September 2025**\\
\\
A small update on features and in-context learning.](https://transformer-circuits.pub/2025/september-update/index.html)

August 2025

[**Circuits Updates — August 2025**\\
\\
A small update: How does a persona modify the assistant’s response?](https://transformer-circuits.pub/2025/august-update/index.html)

July 2025

[**A Toy Model of Mechanistic (Un)Faithfulness**\\
\\
When transcoders go awry.](https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html) [![](https://transformer-circuits.pub/images/attention-qk.png)**Tracing Attention Computation Through Feature Interactions** \\
Kamath et al., 2025\\
\\
We describe and apply a method to explain attention patterns in terms of\\
feature interactions, and integrate this information into attribution graphs.](https://transformer-circuits.pub/2025/attention-qk/index.html) [**A Toy Model of Interference Weights**\\
\\
Unpacking "interference weights" in some more depth.](https://transformer-circuits.pub/2025/interference-weights/index.html) [**Sparse mixtures of linear transforms**\\
\\
We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders.](https://transformer-circuits.pub/2025/bulk-update/index.html) [**Circuits Updates — July 2025**\\
\\
A collection of small updates: revisiting A Mathematical Framework and applications of\\
interpretability to biology.](https://transformer-circuits.pub/

... (truncated, 12 KB total)
Resource ID: 5083d746c2728ff2 | Stable ID: MGJhYTk1MW