Exploring Feature Interactions in Transformer LLMs Through Sparse Autoencoders

web

manifund.org·manifund.org/projects/exploring-feature-interactions-in-t...

A Manifund grant project by Kunvar Thaman exploring how sparse autoencoders (SAEs) can reveal feature interactions and circuits in transformer LLMs, relevant to mechanistic interpretability research in AI safety.

Metadata

Importance: 42/100otherprimary source

Summary

This project aims to use sparse autoencoders to map feature interactions and high-level behaviors in transformer language models. It proposes developing circuit search algorithms based on feature ablation, learning feature manifolds via unsupervised clustering, and connecting SAE dictionaries across different transformer components. The work seeks to advance mechanistic interpretability by making feature decompositions more predictive and scalable.

Key Points

•Proposes using SAE feature ablation to identify causal links between features across transformer layers, improving on existing circuit search methods like ACDC.
•Plans to learn feature manifolds via unsupervised clustering and cosine similarity to group related features and model end-to-end behaviors.
•Investigates how features evolve across transformer components (MLPs, attention heads, residual stream) and what metrics capture these relationships.
•Explores feature splitting within manifolds and conditions under which features emerge or propagate through subsequent layers.
•Partially funded ($8,500 of $15,000 goal), indicating community interest but incomplete support for this interpretability research.

Cached Content Preview

HTTP 200Fetched Apr 12, 20268 KB

Exploring feature interactions in transformer LLMs through sparse autoencoders | Manifund

 

 
 
 
 

 Aug
 SEP
 Oct
 

 
 

 
 17
 
 

 
 

 2024
 2025
 2026
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Common Crawl

 

 

 Web crawl data from Common Crawl.
 

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20250917212149/https://manifund.org/projects/exploring-feature-interactions-in-transformer-llms-through-sparse-autoencoders

 

Manifund

Home

Login

About

People

Categories

Newsletter

Home

About

People

Categories

Login

Create

9

Exploring feature interactions in transformer LLMs through sparse autoencoders

Technical AI safety

Kunvar Thaman

Active

Grant

$8,500raised

$15,000funding goal

Donate

Sign in to donate

p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
Project summary

Sparse autoencoders (SAEs) are good at extracting distinct, significantly monosemantic features out of transformer language models. An effective dictionary of feature decompositions for a transformer component roughly sketches out the set of features that component has learned.

The goal of this project is to provide a more intuitive and structured way of understanding how features interact and form complex high-level behaviors in transformer language models. To do that, we want to define and explore the feature manifold, understand how features evolve over different transformer components, and how we can use these feature decompositions to find some interesting circuits. 

What are this project&#x27;s goals and how will you achieve them?

SAE Feature Decomposition for Circuit Search Algorithm Development

A major challenge in using sparse autoencoders for future interpretability work is turning feature decompositions into effective predictive circuits. Existing algorithms, such as ACDC, are based on pruning of computation graphs and not easily scalable to larger models.

Cunningham et al. demonstrate that causal links can be identified by modifying features from an earlier layer and observing the impact on subsequent layer features. 

A promising approach for a circuit search algorithm would be to observe changes in feature activations upon ablating features in a previous layer. We could focus on a subset of the input distribution to simplify the analysis and find more interpretable features. 

For modeling end-to-end behaviors, we would use an unsupervised learning algorithm to learn clusters of features and identify similar ones (i.e. learnfeature manifolds). We would then use a similarity metric (such as cosine similarity) to group features and use ACDC over the resulting feature decompositions.

Further, investigate how feature splitting occurs in the context of th

... (truncated, 8 KB total)

Resource ID: 504a433b2e834aa1