200 Concrete Open Problems in Mechanistic Interpretability: Introduction

web

2022·LessWrong·lesswrong.com/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-p...

Author

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A foundational resource for anyone entering mechanistic interpretability research; widely cited in the AI safety community as a practical starting point for identifying research directions and open problems.

Forum Post Details

Karma

108

Comments

Forum

lesswrong

Forum Tags

Interpretability (ML & AI)Open ProblemsTransformer CircuitsAI

Part of sequence: 200 Concrete Open Problems in Mechanistic Interpretability

Metadata

Importance: 82/100reference

Summary

This LessWrong post by Neel Nanda introduces a structured research agenda outlining 200 concrete open problems in mechanistic interpretability, aimed at helping researchers—especially newcomers—find tractable and impactful projects. It serves as a roadmap for advancing the scientific understanding of neural network internals as a path toward AI safety.

Key Points

•Provides 200 specific, tractable research problems to guide mechanistic interpretability work, lowering the barrier to entry for new researchers.
•Frames mechanistic interpretability as a core AI safety research direction by building tools to understand what neural networks are actually doing internally.
•Organizes problems by difficulty and sub-topic, making it a practical reference for researchers at varying experience levels.
•Emphasizes the importance of replication, extension, and systematization of existing interpretability findings before tackling frontier models.
•Reflects Neel Nanda's broader mission to grow the mechanistic interpretability research community within the AI safety ecosystem.

Cached Content Preview

HTTP 200Fetched Apr 7, 202623 KB

# 200 Concrete Open Problems in Mechanistic Interpretability: Introduction
By Neel Nanda
Published: 2022-12-28
### **EDIT 19/7/24**: This sequence is now two years old, and fairly out of date. I hope it's still useful for historical reasons, but I no longer recommend it as a reliable source of problems worth working on, eg it doesn't at all discuss Sparse Autoencoders, which I think are one of the more interesting areas around today. Hopefully one day I'll have the time to make a v2!

* * *

*This is the first post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. If you want to learn the basics before you think about open problems, check out *[*my post on getting started*](https://neelnanda.io/getting-started)*.*

**Skip to** [**the final section of this post**](https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability#Overview_of_Sequence) **for an overview of the posts in the sequence**

Introduction
------------

Mechanistic Interpretability (MI) is the study of reverse engineering neural networks. Taking an inscrutable stack of matrices where we know *that *it works, and trying to reverse engineer *how *it works. And often this inscrutable stack of matrices can be decompiled to a human interpretable algorithm! In my (highly biased) opinion, this is one of the most exciting research areas in ML.

There are a lot of reasons to care about mechanistic interpretability research happening. First and foremost, I think that mechanistic interpretability done right can both be highly relevant for alignment. In particular, can we tell whether a model is doing a task well because it’s deceiving us or because it genuinely wants to be helpful? Without being able to look at *how *a task is being done, these are essentially indistinguishable when facing a sufficiently capable model. But it also has a lot of fascinating scientific questions - how do models actually work? Are there fundamental principles and laws underlying them, or is it all an inscrutable mess? 

It is a fact about today’s world that there exist computer programs like GPT-3 that can essentially speak English at a human level, but we have no idea how to write these programs in normal code. It offends me that this is the case, and I see part of the goal of mechanistic interpretability as solving this! And I think that this would be a profound scientific accomplishment.

### Purpose

In addition to being very important, mechanistic interpretability is also a very *young *field, full of low-hanging fruit. There are many fascinating open research questions that might have really impactful results! The point of this sequence is to put my money where my mouth is, and make this concrete. Each post in this sequence is a different category where I think there’s room for significant progress, and a brainstorm of concrete open problems in that area. 

Further, you don’t need a ton of experience to start getting tract

... (truncated, 23 KB total)

Resource ID: 1b05b61e615eaf6c | Stable ID: sid_F1nCjT4rTy