Longterm Wiki

paper

transformer-circuits.pub·transformer-circuits.pub/

Data Status

Not fetched

Cited by 10 pages

Page	Type	Quality
Mesa-Optimization Risk Analysis	Analysis	61.0
Anthropic	Organization	74.0
Conjecture	Organization	37.0
Chris Olah	Person	27.0
Dario Amodei	Person	41.0
AI Alignment	Approach	91.0
Anthropic Core Views	Safety Agenda	62.0
Interpretability	Safety Agenda	66.0
Mechanistic Interpretability	Approach	59.0
Probing / Linear Probes	Approach	55.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202612 KB

Transformer Circuits Thread

[Transformer Circuits Thread](https://transformer-circuits.pub/)

# Anthropic’s Interpretability Research

A surprising fact about modern large language models is that nobody really knows how they work
internally.
The Interpretability team strives to change that — to understand these models to better plan for a
future of
safe AI.

December 2025

[**Circuits Cross-Post — Activation Oracles**\\
\\
We train language models to answer questions about their own activations in natural language.](https://alignment.anthropic.com/2025/activation-oracles)

November 2025

[**Circuits Updates — November 2025**\\
\\
A short update on harm pressure.](https://transformer-circuits.pub/2025/november-update/index.html)

October 2025

[![](https://transformer-circuits.pub/images/introspection.png)**Emergent Introspective Awareness in Large Language Models** \\
Lindsey, 2025\\
\\
We find evidence that language models can introspect on their internal states.](https://transformer-circuits.pub/2025/introspection/index.html) [**Circuits Updates — October 2025**\\
\\
Small updates on visual features and dictionary initialization.](https://transformer-circuits.pub/2025/october-update/index.html) [![](https://transformer-circuits.pub/images/linebreaks.png)**When Models Manipulate Manifolds: The Geometry of a Counting Task** \\
Gurnee et al., 2025\\
\\
We find geometric structure underlying the mechanisms of a fundamental language model behavior.](https://transformer-circuits.pub/2025/linebreaks/index.html)

September 2025

[**Circuits Updates — September 2025**\\
\\
A small update on features and in-context learning.](https://transformer-circuits.pub/2025/september-update/index.html)

August 2025

[**Circuits Updates — August 2025**\\
\\
A small update: How does a persona modify the assistant’s response?](https://transformer-circuits.pub/2025/august-update/index.html)

July 2025

[**A Toy Model of Mechanistic (Un)Faithfulness**\\
\\
When transcoders go awry.](https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html) [![](https://transformer-circuits.pub/images/attention-qk.png)**Tracing Attention Computation Through Feature Interactions** \\
Kamath et al., 2025\\
\\
We describe and apply a method to explain attention patterns in terms of\\
feature interactions, and integrate this information into attribution graphs.](https://transformer-circuits.pub/2025/attention-qk/index.html) [**A Toy Model of Interference Weights**\\
\\
Unpacking "interference weights" in some more depth.](https://transformer-circuits.pub/2025/interference-weights/index.html) [**Sparse mixtures of linear transforms**\\
\\
We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders.](https://transformer-circuits.pub/2025/bulk-update/index.html) [**Circuits Updates — July 2025**\\
\\
A collection of small updates: revisiting A Mathematical Framework and applications of\\
interpretability to biology.](https://transformer-circuits.pub/

... (truncated, 12 KB total)

Resource ID: 5083d746c2728ff2 | Stable ID: MGJhYTk1MW