Understanding SAE Features Using Sparse Feature Circuits

web

manifund.org·manifund.org/projects/understanding-sae-features-using-sp...

A Manifund grant project page for research combining sparse autoencoders (SAEs) with circuit discovery methods to better understand SAE features in language models, conducted at Oxford's Torr Vision Group — relevant to mechanistic interpretability in AI safety.

Metadata

Importance: 42/100otherprimary source

Summary

This project, led by Lovis Heindrich at the University of Oxford, aims to create circuit-based explanations of sparse autoencoder (SAE) features to overcome limitations of existing activation-based explanation methods. The research uses circuit discovery techniques to better understand what causes SAE features to activate, including safety-relevant features. The project received $11,000 in funding for a 3-month full-time research effort.

Key Points

•Combines sparse autoencoders (SAEs) with circuit discovery methods to generate more precise, causally grounded feature explanations.
•Addresses limitations of current approaches (e.g., overly broad explanations, interpretability illusions) in SAE feature understanding.
•Conducted at Oxford's Torr Vision Group, mentored by Fazl Barez and Philip Torr, in collaboration with MIT-IBM Watson Lab.
•Researcher has prior MATS experience with Neel Nanda and published work on MLP circuits in Pythia-70M.
•Key risk: feature circuits may be too distributed or rely on uninterpretable features, limiting explanatory power.

Cached Content Preview

HTTP 200Fetched Apr 11, 202610 KB

Understanding SAE features using Sparse Feature Circuits | Manifund

 

 
 
 
 

 Nov
 DEC
 Jan
 

 
 

 
 29
 
 

 
 

 2024
 2025
 2026
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - https://web.archive.org/web/20251229191004/https://manifund.org/projects/understanding-sae-features-using-sparse-feature-circuits

 

Manifund

Home

Login

About

People

Categories

Newsletter

Home

About

People

Categories

Login

Create

3

Understanding SAE features using Sparse Feature Circuits

Technical AI safety

🐬

Lovis Heindrich

Complete

Grant

$11,000raised

p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
Project summary

I, Lovis Heindrich, am planning to research the use of sparse autoencoder circuits to better understand SAE features. The project will be carried out during a research visit at the Torr Vision Group at the University of Oxford and mentored by Fazl Barez, Prof Philip Torr and in collaboration with Veronika Thost (MIT-IBM Watson Lab). I am seeking funding of $8000 to cover my salary to work on the project full-time for 3 months, as well as $3000 for additional compute budget. In case the compute budget will not be fully utilized, it will be used to cover conference fees.

What are this project&#x27;s goals and how will you achieve them?

Recent work on SAEs [Anthropic 2024] has demonstrated the feasibility of SAE feature discovery in larger models and discovered safety relevant features that are causally important for the models’ behavior. Understanding what causes these features to activate is an important open research question. Our project’s goal is to create circuit-based explanations of such SAE features. Current approaches [Anthropic 2024, OpenAI 2023] that use activating dataset examples to generate feature explanations are limited because they can result in overly broad explanations or interpretability illusions [Bolukbasi et al. 2021]. We plan to make progress on this problem using circuit discovery methods [Syed et al. 2023, Marks et al. 2024, Dunefsky & Chlensky 2024]. We will explore various potential ways the circuit-based explanations can be used to improve our understanding and the usefulness of sparse autoencoder features.

How will this funding be used?

$8000 will cover my salary to work on the project full-time for 3 months. The remaining $3000 will be used for compute and/or conference fees.

Who is on your team and what&#x27;s your track record on similar projects?

Lovis Heindrich: I’m a past MATS scholar where I worked with Neel Nanda and have published relevant work where I analyzed MLP circuits in Pythia-70M. Additionally, I have experience training and evaluating sparse autoencoder

... (truncated, 10 KB total)

Resource ID: b0d83a5cb7e838ee | Stable ID: sid_fKViU9O6Uq