Attribution Patching: Activation Patching At Industrial Scale

web
neelnanda.io·neelnanda.io/mechanistic-interpretability/attribution-pat...
Data Status

Not fetched
Cited by 1 page

Page	Type	Quality
Interpretability	Safety Agenda	66.0
Cached Content Preview

HTTP 200Fetched Feb 26, 202698 KB
[0](https://www.neelnanda.io/cart)

# Attribution Patching: Activation Patching At Industrial Scale

Feb 4

Written By [Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability?author=5ed927247689ec0247c03233)

_The following is a write-up of an (incomplete) project I worked on while at Anthropic, and a significant amount of the credit goes to the then team, Chris Olah, Catherine Olsson, Nelson Elhage & Tristan Hume. I've since cleaned up this project in my personal time and personal capacity._

## TLDR

- **Activation** patching is an existing technique for identifying which model activations are most important for determining model behaviour between two similar prompts that differ in a key detail
- I introduce a technique called **attribution patching**, which uses gradients to take a linear approximation to **activation patching**. (Note the very similar but different names)
  - This is _way_ faster, since activation patching requires a separate forward pass per activation patched, while every attribution patch can be done simulataneously in two forward and one backward pass
  - Attribution patching makes activation patching much more scalable to large models, and can serve as a useful heuristic to find the interesting activations to patch. It serves as a useful but flawed exploratory technique to generate hypotheses to feed into more rigorous techniques.
- In practice, the approximation is a decent approximation when patching in "small" activations like head outputs, and bad when patching in "big" activations like a residual stream.

## Introduction

_Note: I've tried to make this post accessible and to convey intuitions, but it's a pretty technical post and likely only of interest if you care about mech interp and know what activation patching/causal tracing is_

[Activation patching](https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=qeWBvs-R-taFfcCq-S_hgMqx) (aka causal tracing) is one of my favourite innovations in mechanistic interpretability techniques. The beauty of it is by letting you set up a careful counterfactual between a clean input and a corrupted input (ideally the same apart from some key detail), and by **patching** in specific activations from the clean run to the corrupted run, we find which activations are **sufficient** to flip things from the corrupted answer to the clean answer. This is a targeted, causal intervention that can give you strong evidence about which parts of the model do represent the concept in question - if a single activation is sufficient to change the entire model output, that's pretty strong evidence it matters! And you can just iterate over every activation you care about to get some insight into what's going on.

In practice, it and it's variants have gotten [pretty](https://rome.baulab.info/) [impressive](https://arxiv.org/abs/2211.00593) [results](https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing). But one practical problem with activa

... (truncated, 98 KB total)
Resource ID: 85aa9cf8692ba3fc | Stable ID: NWY0NTc2ZG