TransformerLens: A Library for Mechanistic Interpretability of Language Models
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: GitHub
TransformerLens is a widely-used open-source library for mechanistic interpretability research on GPT-style language models, enabling researchers to inspect and manipulate internal activations to reverse-engineer learned algorithms — a core technique in AI safety interpretability work.
Metadata
Summary
TransformerLens is a Python library created by Neel Nanda that enables mechanistic interpretability research on GPT-2 style language models. It supports 50+ open-source models and allows researchers to cache, edit, and analyze internal activations. It has been used in numerous influential interpretability papers including work on grokking, circuit discovery, and neuron representations.
Key Points
- •Supports loading and inspecting internals of 50+ open-source language models including GPT-2
- •Allows caching any internal activation and adding hooks to edit, remove, or replace activations at runtime
- •Used in foundational mechanistic interpretability research (grokking, induction heads, circuit discovery, etc.)
- •Maintained by Bryce Meyer, originally created by Neel Nanda; actively developed with 3.4k+ GitHub stars
- •Provides tutorials and demos to lower the barrier to entry for mechanistic interpretability research
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Probing / Linear Probes | Approach | 55.0 |
| Sparse Autoencoders (SAEs) | Approach | 91.0 |
Cached Content Preview
TransformerLensOrg
/
TransformerLens
Public
Notifications
You must be signed in to change notification settings
Fork
557
Star
3.4k
main Branches Tags Go to file Code Open more actions menu Folders and files
Name Name Last commit message Last commit date Latest commit
History
1,198 Commits 1,198 Commits .devcontainer .devcontainer .github .github .vscode .vscode assets assets debugging debugging demos demos docs docs easy_transformer easy_transformer tests tests transformer_lens transformer_lens .gitattributes .gitattributes .gitconfig .gitconfig .gitignore .gitignore LICENSE LICENSE Main_Demo.ipynb Main_Demo.ipynb README.md README.md makefile makefile pyproject.toml pyproject.toml uv.lock uv.lock View all files Repository files navigation
TransformerLens
A Library for Mechanistic Interpretability of Generative Language Models. Maintained by Bryce Meyer and created by Neel Nanda
This is a library for doing mechanistic
interpretability of GPT-2 Style language models. The
goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms
the model learned during training from its weights.
TransformerLens lets you load in 50+ different open source language models, and exposes the internal
activations of the model to you. You can cache any internal activation in the model, and add in
functions to edit, remove or replace these activations as the model runs.
Quick Start
Install
pip install transformer_lens
Python 3.8 or 3.9
pip install ' transformer_lens~=2.0 '
Use
from transformer_lens . model_bridge import TransformerBridge
# Load a model (eg GPT-2 Small)
bridge = TransformerBridge . boot_transformers ( "gpt2" , device = "cpu" )
# Run the model and get logits and activations
logits , activations = bridge . run_with_cache ( "Hello World" )
TransformerBridge is the recommended 3.0 path and supports 50+ architectures. The legacy HookedTransformer.from_pretrained API is still available through a compatibility layer but is deprecated - see the Migrating to TransformerLens 3 guide for conversion recipes.
Key Tutorials
Introduction to the Library and Mech
Interp
Demo of Main TransformerLens Features
Gallery
Research done involving TransformerLens:
Progress Measures for Grokking via Mechanistic
Interpretability (ICLR Spotlight, 2023) by Neel Nanda, Lawrence
Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Finding Neurons in a Haystack: Case Studies with Sparse
Probing by Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine
Harvey, Dmitrii Troitskii, Dimitris Bertsimas
Towards Automated Circuit Discovery for Mechanistic
Interpretability by Arthur Conmy, Augustine N. Mavor-Parker,
Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
Actually, Othel
... (truncated, 10 KB total)9a1c10a5ca133223 | Stable ID: sid_RApOHPZl7A