Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

paper

2024·arXiv·arxiv.org/html/2407.14494

Authors

Rohan Gupta·Iván Arcuschin·Thomas Kwa·Adrià Garriga-Alonso

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

Cited by 1 page

Page	Type	Quality
Interpretability	Safety Agenda	66.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202698 KB

HTML conversions [sometimes display errors](https://info.dev.arxiv.org/about/accessibility_html_error_messages.html) due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

- failed: csvsimple-ł\_\_csvsim\_package\_expl\_tl.sty

Authors: achieve the best HTML results from your LaTeX submissions by following these [best practices](https://info.arxiv.org/help/submit_latex_best_practices.html).

[License: CC BY 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2407.14494v3 \[cs.LG\] 11 Oct 2025

# InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Report issue for preceding element

Rohan Gupta

cybershiptrooper@gmail.com

&Iván Arcuschin11footnotemark: 1

University of Buenos Aires

iarcuschin@dc.uba.ar

Equal contributions.Thomas Kwa

kwathomas0@gmail.com

&Adrià Garriga-Alonso

FAR AI

adria@far.ai

Report issue for preceding element

###### Abstract

Report issue for preceding element

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model’s output.
We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr’s original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

Report issue for preceding element

## 1 Introduction

Report issue for preceding element

The field of mechanistic interpretability (MI) aims to reverse-engineer the algorithm implemented by a neural network
\[ [14](https://arxiv.org/html/2407.14494v3#bib.bib14 "")\]. The current MI paradigm holds that the neural network (NN) represents concepts as
_features_, which may have their dedicated subspace \[ [31](https://arxiv.org/html/2407.14494v3#bib.bib31 ""), [8](https://arxiv.org/html/2407.14494v3#bib.bib8 "")\] or be in _superposition_ with
other features
\[ [32](https://arxiv.org/html/2407.14494v3#bib.bib32 ""), [15](https://arxiv.org/html/2407.14494v3#bib.bib15 ""), [16](https://arxiv.org/html/2407.14494v3#bib.bib16 "")\]. The NN
arrives at its output by composing many _circuits_, which are subcomponents that implement particular funct

... (truncated, 98 KB total)

Resource ID: 25d0620e3c6a2ea4 | Stable ID: N2RmN2NmOD