Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

This 2021 Anthropic paper is considered foundational for mechanistic interpretability; it introduced core concepts like induction heads, superposition, and the residual stream framework that underpin much subsequent interpretability research.

Metadata

Importance: 92/100blog postprimary source

Summary

This foundational paper from Anthropic's interpretability team develops a mathematical framework for understanding transformer neural networks as compositions of circuits. It introduces key concepts like attention heads as independent computations, the residual stream as a communication channel, and the superposition hypothesis, providing tools to reverse-engineer how transformers implement algorithms.

Key Points

  • Introduces the 'residual stream' perspective: each layer reads from and writes to a shared residual stream rather than operating sequentially
  • Decomposes attention heads into independent query-key-value operations, enabling circuit-level analysis of multi-head attention
  • Proposes the superposition hypothesis: models represent more features than they have dimensions by encoding them in overlapping, near-orthogonal directions
  • Identifies induction heads as a concrete mechanistic circuit for in-context learning, demonstrating how circuits implement algorithms
  • Establishes foundational vocabulary and methodology for mechanistic interpretability research on transformer models

Cited by 2 pages

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB
A Mathematical Framework for Transformer Circuits 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 A Mathematical Framework for Transformer Circuits

 
 
 

 
 Authors

 Nelson Elhage ∗† , Neel Nanda ∗ , Catherine Olsson ∗ , Tom Henighan † , Nicholas Joseph † , Ben Mann † , Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds , Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah ‡ Affiliation

 Anthropic Published

 Dec 22, 2021 * Core Research Contributor; † Core Infrastructure Contributor; ‡ Correspondence to colah@anthropic.com ; Author contributions statement below . 

 
 

 

 
 

 
 Transformer language models are an emerging technology that is gaining increasingly broad real-world use, for example in systems like GPT-3 , LaMDA , Codex , Meena , Gopher , and similar models.  However, as these models scale, their open-endedness and high capacity creates an increasing scope for unexpected and sometimes harmful behaviors.  Even years after a large model is trained, both creators and users routinely discover model capabilities – including problematic behaviors – they were previously unaware of.

 One avenue for addressing these issues is mechanistic interpretability , attempting to reverse engineer the detailed computations performed by transformers, similar to how a programmer might try to reverse engineer complicated binaries into human-readable source code.  If this were possible, it could potentially provide a more systematic approach to explaining current safety problems, identifying new ones, and perhaps even anticipating the safety problems of powerful future models that have not yet been built.  A previous project, the Distill Circuits thread , has attempted to reverse engineer vision models, but so far there hasn’t been a comparable project for transformers or language models.

 In this paper, we attempt to take initial, very preliminary steps towards reverse-engineering transformers.  Given the incredible complexity and size of modern language models, we have found it most fruitful to start with the simplest possible models and work our way up from there.  Our aim is to discover simple algorithmic patterns, motifs, or frameworks that can subsequently be applied to larger and more complex models.  Specifically, in this paper we will study transformers with two layers or less which have only attention blocks – this is in contrast to a large, modern transformer like GPT-3, which has 96 layers and alternates attention blocks with MLP blocks.

 We find that by conceptualizing the operation of transformers in a new but mathematically equivalent way, we are able to make sense of these small models and gain significant understanding of how they operate internally.  Of particular note, we find that specific attention heads that we term “induction heads” can expla

... (truncated, 98 KB total)
Resource ID: b948d6282416b586 | Stable ID: sid_5Bz9781P9P