Skip to content
Longterm Wiki
Back

Transformer Circuits Thread

paper

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

This is the canonical landing page for Anthropic's mechanistic interpretability research program; it serves as an index to all Transformer Circuits papers and updates and is essential reading for anyone studying AI internals for safety purposes.

Metadata

Importance: 88/100blog posthomepage

Summary

The Transformer Circuits Thread is Anthropic's primary publication hub for mechanistic interpretability research on large language models. It hosts foundational and ongoing research aimed at understanding the internal workings of transformer models, including work on circuits, features, sparse autoencoders, and attribution graphs. The thread represents a sustained research program toward making AI systems more understandable and safer.

Key Points

  • Central repository for Anthropic's mechanistic interpretability research, including landmark papers on circuits, superposition, and sparse autoencoders.
  • Hosts both major papers (e.g., 'Biology of a Large Language Model') and shorter monthly updates on incremental findings.
  • Research covers attention mechanisms, feature geometry, introspection in LLMs, transcoders, and attribution graphs.
  • Directly motivated by AI safety: the team aims to understand model internals to better plan for safe AI development.
  • Publishes accessible formats including toy models, cross-posts with alignment blog, and research updates.

Cited by 4 pages

PageTypeQuality
AnthropicOrganization74.0
Chris OlahPerson27.0
Mechanistic InterpretabilityResearch Area59.0
Probing / Linear ProbesApproach55.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202610 KB
Transformer Circuits Thread 
 -->
 
 

 
 
 
 

 

 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 

 
 Anthropic’s Interpretability Research

 A surprising fact about modern large language models is that nobody really knows how they work
 internally.
 The Interpretability team strives to change that — to understand these models to better plan for a
 future of
 safe AI.

 
 
 
 April 2026 

 
 
 Emotion Concepts and their Function in a Large Language Model

 Sofroniew et al., 2026 
 
 We find representations of emotion concepts in Claude Sonnet 4.5 and show that they causally influence its outputs.
 
 

 December 2025 
 
 Circuits Cross-Post — Activation Oracles

 
 We train language models to answer questions about their own activations in natural language.
 
 

 November 2025 
 
 Circuits Updates — November 2025

 
 A short update on harm pressure.
 
 

 October 2025 

 
 
 Emergent Introspective Awareness in Large Language Models

 Lindsey, 2025 
 We find evidence that language models can introspect on their internal states. 
 

 
 Circuits Updates — October 2025

 
 Small updates on visual features and dictionary initialization.
 
 
 
 
 When Models Manipulate Manifolds: The Geometry of a Counting Task

 Gurnee et al., 2025 
 We find geometric structure underlying the mechanisms of a fundamental language model behavior. 
 

 September 2025 
 
 Circuits Updates — September 2025

 
 A small update on features and in-context learning.
 
 

 August 2025 
 
 Circuits Updates — August 2025

 
 A small update: How does a persona modify the assistant’s response?
 
 

 July 2025 

 
 A Toy Model of Mechanistic (Un)Faithfulness

 
 When transcoders go awry.
 
 

 
 
 Tracing Attention Computation Through Feature Interactions

 Kamath et al., 2025 
 We describe and apply a method to explain attention patterns in terms of
 feature interactions, and integrate this information into attribution graphs. 
 

 
 A Toy Model of Interference Weights

 
 Unpacking "interference weights" in some more depth.
 
 

 
 Sparse mixtures of linear transforms

 
 We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders.
 
 

 
 Circuits Updates — July 2025

 
 A collection of small updates: revisiting A Mathematical Framework and applications of
 interpretability to biology.
 
 
 
 Automated Auditing

 
 A note on using agents to perform automated alignment audits, including using interpretability
 tools.
 
 

 April 2025 
 
 Circuits Updates — April 2025

 
 A collection of small updates: jailbreaks, dense features, and spinning up on interpretability.
 
 
 
 Progress on Attention

 
 An update on our progress studying attention.
 
 

 March 2025 
 
 
 On the Biology of a Large Language Model

 Lindsey et al., 2025 
 
 We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight
 production model — in a variety of contexts.
 
 
 
 
 Circuit Tracing: Revealing Computational Graphs in Language Models

 Ameisen et 

... (truncated, 10 KB total)
Resource ID: 5083d746c2728ff2 | Stable ID: sid_Voqkpakv6h