Assessing skeptical views of interpretability research | Christopher Potts

web

web.stanford.edu·web.stanford.edu/~cgpotts/blog/interp/

Data Status

Not fetched

Cited by 2 pages

Page	Type	Quality
Goodfire	Organization	68.0
Chris Olah	Person	27.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202614 KB

![Author](https://web.stanford.edu/~cgpotts/blog/interp/severance-tombrink.png)

Credit: [Tom Brink](https://radtom.com/Severance-MDR-Terminal)

By [Christopher Potts](https://web.stanford.edu/~cgpotts/) – August 8, 2025

[Goodfire](https://www.goodfire.ai/) and [Anthropic](https://www.anthropic.com/) have jointly organized a meet-up of academic and industry researchers called “Interpretability: the next 5 years”, to be held later this month. Participants have been invited to contribute short discussion documents. This is a draft of my document, which I am posting publicly to try to stimulate discussion in the broader community.

It’s an awkward time for interpretability research in AI. On the one hand, the pace of technical innovation has been incredible, and the number of success stories is rising fast. On the other hand, I think all of us in the field are accustomed to responses to our work that range from skepticism to dismissal.

The dismissals still surprise me. I assure you I am not that old, but I have still seen many areas shift from being perceived as obscure, irrelevant dead-ends to _the thing_ defining the future. The most prominent example is neural network research itself – once mocked as a dead-end, now the life-blood of the field. The most extreme and sudden shift is the task of natural language generation; this used to be a niche topic, and now it practically defines AI itself. All the tasks grouped under the heading of “Natural Language Understanding” similarly shifted from the periphery to the mainstream over the last 10 years. The most recent sea change is for reinforcement learning, which went from deeply unfashionable to the hottest thing seemingly overnight [in 2022](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf).

Having lived through all these transitions, I am really shy about dismissing anything out of hand. I worry about the damage I could do by steering people away from what turn out to be significant topics. It seems much wiser to proceed claim by claim than to brush aside entire areas.

So, outright dismissal of interpretability research seems ill-considered. Skepticism is healthy, though, and it will pay to understand the driving forces behind this skepticism. Here I will try to articulate the skeptical positions I often encounter and provide my current responses to them.

## Interpretability cannot be achieved in any meaningful sense

This claim usually stems from the position that there is an inherent conflict between quality and interpretability, for neural networks in particular or as a fact about learning and complexity in general. It follows from this assumption that our best models will be uninterpretable. A similar position holds that, as systems become more complex, their faithful explanations lawfully become more complex as well, at such a rate that the explanations become useless to us. These positions (and some persuasive replies) are pr

... (truncated, 14 KB total)

Resource ID: 27495e9d5c0d4e56 | Stable ID: NDBkZmI1YT