Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
webAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: LessWrong
A Redwood Research technical contribution that attempts to bring rigor and quantitative evaluation to mechanistic interpretability, addressing the challenge of verifying whether a proposed circuit explanation truly captures a model's computation.
Forum Post Details
Metadata
Summary
Redwood Research introduces causal scrubbing, a principled algorithmic method for evaluating the quality of mechanistic interpretability hypotheses about neural networks. The approach works by replacing activations in a model with those from different inputs to test whether a proposed computational graph explanation is sufficient to account for model behavior. It provides a quantitative metric for how well an interpretability hypothesis explains a model's computations.
Key Points
- •Causal scrubbing formalizes interpretability hypothesis testing by checking if a proposed causal graph can fully account for a model's input-output behavior.
- •The method replaces internal activations with counterfactual values to test which components are causally relevant to model outputs.
- •Provides a quantitative 'scrubbing loss' metric that measures how much model performance degrades when irrelevant activations are randomized.
- •Addresses a key weakness in mechanistic interpretability: the lack of rigorous, falsifiable standards for evaluating circuit-level explanations.
- •Applied to case studies including induction heads and indirect object identification circuits to validate the methodology.
Cached Content Preview
# Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
By LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck, Nate Thomas
Published: 2022-12-03
*\* Authors sorted alphabetically.*
Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via *behavior-preserving resampling ablations*. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced.
1 Introduction
==============
A question that all mechanistic interpretability work must answer is, “how well does this interpretation explain the phenomenon being studied?”. In the [many](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) [recent](https://rome.baulab.info/) [papers](https://arxiv.org/abs/2211.00593) [in mechanistic interpretability](https://openreview.net/forum?id=9XFSbDPmdW), researchers have generally relied on ad-hoc methods to evaluate the quality of interpretations.[^dtq6l2laqcp]
This *ad hoc* nature of existing evaluation methods poses a serious challenge for scaling up mechanistic interpretability. Currently, to evaluate the quality of a particular research result, we need to deeply understand both the interpretation and the phenomenon being explained, and then apply researcher judgment. Ideally, we’d like to find the interpretability equivalent of [property-based testing](https://en.wikipedia.org/wiki/Software_testing%23Property_testing)—automatically checking the correctness of interpretations, instead of relying on grit and researcher judgment. More systematic procedures would also help us scale-up interpretability efforts to larger models, behaviors with subtler effects, and to larger teams of researchers. To help with these efforts, we want a procedure that is both powerful enough to finely distinguish better interpretations from worse ones, and general enough to be applied to complex interpretations.
In this work, we propose **causal scrubbing**, a systematic ablation method for testing precisely stated hypotheses about how a particular neural network[^bwu0kfb3tw] implements a behavior on a dataset. Specifically, given an informal hypothesis about which parts of a model implement the intermediate calculations required for a behavior, we convert this to a formal correspondence between a computational graph for the model and a human-interpretable computational graph. Then, causal scrubbing starts from the output and recursively finds all of the invariances of parts of the neural network that are implied by the hypothesis, and then replaces the activations of the neural network with the *maximum entropy*[^g10ehlzhmhl] distribution subject to certain natur
... (truncated, 47 KB total)018b403483a001b9 | Stable ID: sid_KXjhFEgZCZ