Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

paper

2024·arXiv·arxiv.org/html/2407.14494

Authors

Rohan Gupta·Iván Arcuschin·Thomas Kwa·Adrià Garriga-Alonso

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces InterpBench, a benchmark of semi-synthetic transformers with known circuits for validating mechanistic interpretability methods, addressing a critical gap in evaluating neural network interpretability techniques essential for AI safety.

Paper Details

Citations

0 influential

Year

2024

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

Summary

This paper introduces InterpBench, a benchmark of semi-synthetic transformers with known internal circuits designed to evaluate mechanistic interpretability methods. The authors develop Strict Interchange Intervention Training (SIIT), an improved training technique that aligns neural network computations with specified causal models while preventing non-circuit components from influencing outputs. They demonstrate that SIIT can produce realistic transformers with known circuits—including complex ones like Indirect Object Identification—and use this benchmark to evaluate existing circuit discovery techniques, addressing a key validation challenge in mechanistic interpretability research.

Cited by 1 page

Page	Type	Quality
Interpretability	Research Area	66.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202696 KB

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques 
 
 
 

 
 

 
 
 
 
 
 InterpBench : Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

 
 
 Rohan Gupta
 cybershiptrooper@gmail.com 
 &Iván Arcuschin 1 1 footnotemark: 1 
 University of Buenos Aires 
 iarcuschin@dc.uba.ar 
 
 Equal contributions. 
    
 Thomas Kwa
 kwathomas0@gmail.com 
 &Adrià Garriga-Alonso
 FAR AI
 adria@far.ai 
 
 
 
 
 Abstract

 Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench , a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model’s output.
We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr’s original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

 
 
 
 1 Introduction

 
 The field of mechanistic interpretability (MI) aims to reverse-engineer the algorithm implemented by a neural network
 [ 14 ] . The current MI paradigm holds that the neural network (NN) represents concepts as
 features , which may have their dedicated subspace [ 31 , 8 ] or be in superposition with
other features
 [ 32 , 15 , 16 ] . The NN
arrives at its output by composing many circuits , which are subcomponents that implement particular functions on
the features [ 32 , 9 , 20 ] .
To date, the field has been very successful at reverse-engineering toy models on simple tasks
 [ 30 , 47 , 10 , 11 , 7 ] . For larger
models, researchers have discovered circuits that perform clearly defined subtasks
 [ 43 , 22 , 23 , 27 ] .

 
 
 How confident can we be that the NNs implement the claimed circuits? The central piece of evidence for many circuit papers is
 causal consistency : if we intervene on the network’s internal activations, does the circuit correctly predict changes in the output? There
are several competing formalizations of consistency
 [ 10 , 43 , 20 , 25 ] 
and many ways to ablate NNs, each yielding different results
 [ 35 , 12 , 46 ] . This problem is especially dire for automatic circuit discovery methods, which search for subgraphs with the highest consistency [ 21 , 45 ] or faithfulness
 [ 12 , 39 ] measurements 1 1 1 Faithfulness is a weaker form of consistency: if we ablate every part of the NN that is not part of the circuit, does the NN still perform the task? [ 

... (truncated, 96 KB total)

Resource ID: 25d0620e3c6a2ea4 | Stable ID: sid_7ZZh4BUB67