Anthropic: Interpretability Info Sheet (PDF)

web
assets.anthropic.com·assets.anthropic.com/m/7b1761976975203a/original/Anthropi...
Data Status

Not fetched
Cited by 1 page

Page	Type	Quality
Anthropic	Organization	74.0
Cached Content Preview

HTTP 200Fetched Feb 25, 20265 KB
# ANTHROP\\C

# Unlocking AI understanding: Advancing interpretability at Anthropic

A fundamental problem for AI safety is that nobody understands how large language models work. Think of it like the human brain—we know it’s capable of incredible feats, but neuroscientists are nowhere near to fully cracking its code.

Anthropic’s Interpretability team pioneered the use of a method called “Dictionary Learning” that throws light on the inner workings of AI models. The method uncovers the way that the model represents different concepts—ideas like, say, “friendship”, “screwdrivers”, or “Paris”—within its neural network.

Knowing how AIs organize concepts helps to make them more interpretable: we can, to some limited degree, work out what they’re “thinking”, which has big implications for how we use them for work and elsewhere. And as we’ll detail below, it might also help make them safer.

# Overcoming the Challenge of Superposition

One key obstacle to understanding AI models is the phenomenon of “superposition.” Unlike traditional computer programs where each component has a clear, singular purpose, the neurons inside AI models don’t correspond to individual concepts. Instead, information is distributed across the network in complex, overlapping patterns. In this respect it’s similar to the English alphabet: outside of exceptions like “I”, a single character doesn’t mean anything on its own; only in combination with other characters does it take on meaning. And with AI models, we don’t know how that alphabet fits together: even when we look inside the “black box” of an AI model, we don’t immediately understand what we’re seeing.

To truly understand AI models, we need specialized methods to break down this superposition, much like neuroscientists use various techniques (MRI scans, EEG, and so on) to understand the human brain. This is where our Dictionary Learning technique comes in: it allows us to decipher the features that are represented inside a model. In future, we might be able to manipulate these features—amplifying them or dampening them down—to change, in a very precise way, the way the model behaves.

# Mapping the mind of a large language model

Our research uncovered many millions of features that are represented inside Claude 3 Sonnet, from concrete objects to abstract concepts. For instance, in the figure below you can see a map of features that relate to the abstract idea of “inner conflict”: you can see how features that are more closely related in their meaning can be grouped together. You can also see how specific these features are: for example, the model understands the concepts of “hesitation detection” and “competing tradeoffs”.

But just like in the human brain, the concepts can also be much more concrete. We found, for example, that there was a specific feature for the Golden Gate Bridge—it activated when users asked Claude about famous bridges in San Francisco, or red suspension bridges, or many similar prompts.

As a de

... (truncated, 5 KB total)
Resource ID: c3f0e3c4b12ff103 | Stable ID: MDIxNjYyNj