Back
open-source automated interpretability
webblog.eleuther.ai·blog.eleuther.ai/autointerp/
Data Status
Not fetched
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Interpretability | Safety Agenda | 66.0 |
| Sparse Autoencoders (SAEs) | Approach | 91.0 |
Cached Content Preview
HTTP 200Fetched Feb 26, 202636 KB
Table of Contents
- [Background](https://blog.eleuther.ai/autointerp/#background)
- [Key Findings](https://blog.eleuther.ai/autointerp/#key-findings)
- [Generating Explanations](https://blog.eleuther.ai/autointerp/#generating-explanations)
- [Scoring explanations](https://blog.eleuther.ai/autointerp/#scoring-explanations)
- [Results](https://blog.eleuther.ai/autointerp/#results)
- [Explainers](https://blog.eleuther.ai/autointerp/#explainers) - [How does the explainer model size affect explanation quality?](https://blog.eleuther.ai/autointerp/#how-does-the-explainer-model-size-affect-explanation-quality)
- [Providing more information to the explainer](https://blog.eleuther.ai/autointerp/#providing-more-information-to-the-explainer)
- [Giving the explainer different samples of top activating examples](https://blog.eleuther.ai/autointerp/#giving-the-explainer-different-samples-of-top-activating-examples)
- [Visualizing activation distributions](https://blog.eleuther.ai/autointerp/#visualizing-activation-distributions)
- [Scorers](https://blog.eleuther.ai/autointerp/#scorers) - [How do methods correlate with simulation?](https://blog.eleuther.ai/autointerp/#how-do-methods-correlate-with-simulation)
- [How does scorer model size affect scores?](https://blog.eleuther.ai/autointerp/#how-does-scorer-model-size-affect-scores)
- [How much more scalable is detection/fuzzing?](https://blog.eleuther.ai/autointerp/#how-much-more-scalable-is-detectionfuzzing)
- [Filtering with known heuristics](https://blog.eleuther.ai/autointerp/#filtering-with-known-heuristics) - [Positional Features](https://blog.eleuther.ai/autointerp/#positional-features)
- [Unigram features](https://blog.eleuther.ai/autointerp/#unigram-features)
- [Sparse Feature Circuits](https://blog.eleuther.ai/autointerp/#sparse-feature-circuits)
- [Future Directions](https://blog.eleuther.ai/autointerp/#future-directions)
- [Appendix](https://blog.eleuther.ai/autointerp/#appendix)
## Background [\#](https://blog.eleuther.ai/autointerp/\#background)
Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring arbitrary text explanations of SAE features, and release a open source library to allow people to do research on auto-interpreted features.
## Key Findings [\#](https://blog.eleuther.ai/autointerp/\#key-findings)
- Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet.
- Explanations found by LLMs are similar to explanations found by humans.
- Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k.
- Code can be found at [https://github.com/EleutherAI/sae-auto-interp](https://github.com/EleutherAI/sae-auto-int
... (truncated, 36 KB total)Resource ID:
daaf778f7ff52bc2 | Stable ID: MDQyNmMyZT