Skip to content
Longterm Wiki
Back

open-source automated interpretability

web

This EleutherAI blog post and associated codebase provides an open-source alternative to closed-lab automated interpretability pipelines, relevant for researchers studying how to understand internal representations of large language models via sparse autoencoders.

Metadata

Importance: 62/100blog posttool

Summary

EleutherAI introduces an open-source pipeline for automated interpretability of neural network features, particularly targeting sparse autoencoder (SAE) features. The project automates the process of generating natural language explanations for model internals and scoring their quality, making mechanistic interpretability research more scalable and accessible. It builds on prior work like OpenAI's automated interpretability but releases tooling publicly.

Key Points

  • Releases an open-source library for automated generation and evaluation of natural language explanations for SAE features in language models.
  • Automates the interpretability pipeline: detecting active features, generating explanations via LLMs, and scoring explanation quality.
  • Aims to democratize mechanistic interpretability research by providing scalable, reproducible tooling outside of closed labs.
  • Benchmarks explanation quality using detection and fuzzing scoring methods to assess whether explanations accurately capture feature behavior.
  • Supports integration with sparse autoencoders trained on open models, enabling community-driven interpretability research.

Cited by 2 pages

PageTypeQuality
InterpretabilityResearch Area66.0
Sparse Autoencoders (SAEs)Approach91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202628 KB
Open Source Automated Interpretability for Sparse Autoencoder Features | EleutherAI Blog 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 

 

 

 
 
 
 
 
 

 

 
 
 
 

 
 
 
 
 
 Table of Contents 
 
 
 
 Background 

 
 Key Findings 

 
 Generating Explanations 

 
 Scoring explanations  

 
 Results 

 
 Explainers 
 
 
 How does the explainer model size affect explanation quality?  

 
 Providing more information to the explainer 

 
 Giving the explainer different samples of top activating examples 

 
 Visualizing activation distributions 
 
 

 
 Scorers 
 
 
 How do methods correlate with simulation? 

 
 How does scorer model size affect scores?  

 
 How much more scalable is detection/fuzzing?  
 
 

 
 Filtering with known heuristics 
 
 
 Positional Features 

 
 Unigram features 
 
 

 
 Sparse Feature Circuits 

 
 Future Directions 

 
 Appendix 
 

 
 
 
 

 Background # 

 Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring arbitrary text explanations of SAE features, and release a open source library to allow people to do research on auto-interpreted features.

 Key Findings # 

 
 
 Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet.

 
 Explanations found by LLMs are similar to explanations found by humans.

 
 Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost \$1300 in API calls to Llama 3.1 or \$8500 with Claude 3.5 Sonnet. Prior methods cost ~\$200k.

 
 Code can be found at https://github.com/EleutherAI/sae-auto-interp . 

 
 We built a small dashboard to explore explanations and their scores: https://cadentj.github.io/demo/ 

 
 Generating Explanations # 

 Sparse autoencoders decompose activations into a sum of sparse feature directions. We leverage language models to generate explanations for activating text examples. Prior work prompts language models with token sequences that activate MLP neurons ( Bills et al. 2023), by showing the model a list of tokens followed by their respective activations, separated by a tab, and listed one per line. 

 We instead highlight max activating tokens in each example with a set of <<delimiters>>. Optionally, we choose a threshold of the example’s max activation for which tokens are highlighted. This helps the model distinguish important information for some densely activating features.

 Example 1:  and he was <<over the moon>> to find

Example 2:  we'll be laughing <<till the cows come home>>! Pro

Example 3:  thought Scotland was boring, but really there's more <<than meets the eye>>! I'd
 
 We experiment with several methods for augmenting the explanation.

 Chain of thought improves general reasoning capabilities in language models. We few-shot the model with several examples of

... (truncated, 28 KB total)
Resource ID: daaf778f7ff52bc2 | Stable ID: sid_wY9SixBFcP