Back
Neel Nanda's Mechanistic Interpretability Research Hub
webneelnanda.io·neelnanda.io/mechanistic-interpretability
Neel Nanda's mechanistic interpretability page aggregates his research posts, guides, and walkthroughs on understanding transformer internals, making it a central hub for researchers entering the mechanistic interpretability field.
Metadata
Importance: 82/100homepage
Summary
This page serves as a curated index of Neel Nanda's mechanistic interpretability work, including research posts on superposition, induction heads, attribution patching, and Othello-GPT, as well as introductory guides and paper walkthroughs. It covers both original research contributions and educational resources for newcomers to the field. The content spans from foundational explainers to cutting-edge empirical findings about how transformers represent and process information.
Key Points
- •Hosts original research including attribution patching, emergent positional embeddings, and linear world representations in Othello-GPT.
- •Provides comprehensive educational resources: quickstart guides, prerequisites, glossaries, and annotated reading lists for mechanistic interpretability.
- •Includes walkthroughs of key papers like 'A Mathematical Framework for Transformer Circuits' and 'Toy Models of Superposition'.
- •Covers practical research methodology including activation patching, circuit analysis, and reverse-engineering model computations.
- •Serves as a primary entry point for researchers wanting to contribute to mechanistic interpretability as an AI safety research agenda.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Probing / Linear Probes | Approach | 55.0 |
Cached Content Preview
HTTP 200Fetched Apr 28, 20264 KB
Neel Nanda
7/18/23
Neel Nanda
7/18/23
Tiny Mech Interp Projects: Emergent Positional Embeddings of Words
A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"
Read More
Neel Nanda
3/28/23
Neel Nanda
3/28/23
Actually, Othello-GPT Has A Linear Emergent World Representation
A write up of work extending and building on the paper Emergent World Representations
Read More
Neel Nanda
3/12/23
Neel Nanda
3/12/23
Paper Replication Walkthrough: Reverse-Engineering Modular Addition
Read More
Neel Nanda
2/4/23
Neel Nanda
2/4/23
Attribution Patching: Activation Patching At Industrial Scale
A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable
Read More
Neel Nanda
2/4/23
Neel Nanda
2/4/23
Mech Interp Project Advising Call: Memorisation in GPT-2 Small
Read More
Neel Nanda
1/31/23
Neel Nanda
1/31/23
Mechanistic Interpretability Quickstart Guide
An intro guide to a mechanistic interpretability weekend hackathon
Read More
Neel Nanda
12/27/22
Neel Nanda
12/27/22
A Walkthrough of Toy Models of Superposition
Read More
Neel Nanda
12/26/22
Neel Nanda
12/26/22
Analogies between Software Reverse Engineering and Mechanistic Interpretability
Read More
Neel Nanda
12/25/22
Neel Nanda
12/25/22
Concrete Steps to Get Started in Transformer Mechanistic Interpretability
Read More
Neel Nanda
12/21/22
Neel Nanda
12/21/22
A Comprehensive Mechanistic Interpretability Explainer & Glossary
Read More
Neel Nanda
11/22/22
Neel Nanda
11/22/22
A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2
Read More
Neel Nanda
11/7/22
Neel Nanda
11/7/22
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Read More
Neel Nanda
11/1/22
Neel Nanda
11/1/22
Re
... (truncated, 4 KB total)Resource ID:
46841681c285ec4c | Stable ID: sid_UI8NLzDXIA