Back
Exploring Feature Interactions in Transformer LLMs Through Sparse Autoencoders
webA Manifund grant project by Kunvar Thaman exploring how sparse autoencoders (SAEs) can reveal feature interactions and circuits in transformer LLMs, relevant to mechanistic interpretability research in AI safety.
Metadata
Importance: 42/100otherprimary source
Summary
This project aims to use sparse autoencoders to map feature interactions and high-level behaviors in transformer language models. It proposes developing circuit search algorithms based on feature ablation, learning feature manifolds via unsupervised clustering, and connecting SAE dictionaries across different transformer components. The work seeks to advance mechanistic interpretability by making feature decompositions more predictive and scalable.
Key Points
- •Proposes using SAE feature ablation to identify causal links between features across transformer layers, improving on existing circuit search methods like ACDC.
- •Plans to learn feature manifolds via unsupervised clustering and cosine similarity to group related features and model end-to-end behaviors.
- •Investigates how features evolve across transformer components (MLPs, attention heads, residual stream) and what metrics capture these relationships.
- •Explores feature splitting within manifolds and conditions under which features emerge or propagate through subsequent layers.
- •Partially funded ($8,500 of $15,000 goal), indicating community interest but incomplete support for this interpretability research.
Cached Content Preview
HTTP 200Fetched Apr 12, 20268 KB
Exploring feature interactions in transformer LLMs through sparse autoencoders | Manifund
Aug
SEP
Oct
17
2024
2025
2026
success
fail
About this capture
COLLECTED BY
Collection: Common Crawl
Web crawl data from Common Crawl.
TIMESTAMPS
The Wayback Machine - http://web.archive.org/web/20250917212149/https://manifund.org/projects/exploring-feature-interactions-in-transformer-llms-through-sparse-autoencoders
Manifund
Home
Login
About
People
Categories
Newsletter
Home
About
People
Categories
Login
Create
9
Exploring feature interactions in transformer LLMs through sparse autoencoders
Technical AI safety
Kunvar Thaman
Active
Grant
$8,500raised
$15,000funding goal
Donate
Sign in to donate
p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
Project summary
Sparse autoencoders (SAEs) are good at extracting distinct, significantly monosemantic features out of transformer language models. An effective dictionary of feature decompositions for a transformer component roughly sketches out the set of features that component has learned.
The goal of this project is to provide a more intuitive and structured way of understanding how features interact and form complex high-level behaviors in transformer language models. To do that, we want to define and explore the feature manifold, understand how features evolve over different transformer components, and how we can use these feature decompositions to find some interesting circuits.
What are this project's goals and how will you achieve them?
SAE Feature Decomposition for Circuit Search Algorithm Development
A major challenge in using sparse autoencoders for future interpretability work is turning feature decompositions into effective predictive circuits. Existing algorithms, such as ACDC, are based on pruning of computation graphs and not easily scalable to larger models.
Cunningham et al. demonstrate that causal links can be identified by modifying features from an earlier layer and observing the impact on subsequent layer features.
A promising approach for a circuit search algorithm would be to observe changes in feature activations upon ablating features in a previous layer. We could focus on a subset of the input distribution to simplify the analysis and find more interpretable features.
For modeling end-to-end behaviors, we would use an unsupervised learning algorithm to learn clusters of features and identify similar ones (i.e. learnfeature manifolds). We would then use a similarity metric (such as cosine similarity) to group features and use ACDC over the resulting feature decompositions.
Further, investigate how feature splitting occurs in the context of th
... (truncated, 8 KB total)Resource ID:
504a433b2e834aa1