Sparse Autoencoders (SAEs)
- Links27 links could use <R> components
Overview
Section titled “Overview”Sparse Autoencoders (SAEs) represent a breakthrough technique in mechanistic interpretability that addresses the fundamental challenge of neural network polysemanticity. In modern language models, individual neurons often respond to multiple unrelated concepts (e.g., a single neuron activating for both “the Golden Gate Bridge” and “requests for help”), making direct interpretation of neural activations extremely difficult. SAEs solve this by learning to decompose dense, polysemantic activations into sparse, monosemantic feature vectors where each dimension corresponds to a single interpretable concept.
The technique works by training an auxiliary neural network to reconstruct model activations through a bottleneck that encourages sparse representations. When trained on billions of activation samples, SAEs discover features that correspond to human-interpretable concepts ranging from concrete entities like “San Francisco” to abstract notions like “deception in political contexts.” Anthropic’s landmark 2024 work extracted over 34 million interpretable features from Claude 3 Sonnet, with automated evaluation finding that 90% of high-activating features have clear human-interpretable explanations.
For AI safety, SAEs offer a potentially transformative capability: the ability to directly detect safety-relevant cognition inside models. Researchers have identified features corresponding to lying, manipulation, security vulnerabilities, power-seeking behavior, and sycophancy. If SAE research scales successfully, it could provide the foundation for runtime monitoring systems that flag concerning internal states, deception detection during training, and verification that alignment techniques actually work at the mechanistic level.
Quick Assessment
Section titled “Quick Assessment”| Dimension | Assessment | Evidence |
|---|---|---|
| Research Investment | High ($50-100M/yr) | Goodfire raised $50M Series A (April 2025); Anthropic, DeepMind, OpenAI dedicated teams |
| Feature Extraction Scale | 34M+ features | Anthropic 2024 extracted 34M from Claude 3 Sonnet with 90% interpretability scores |
| Model Coverage | 100B+ parameters | Works on Claude 3 Sonnet, GPT-4, Gemma 2/3 up to 27B; Goodfire released SAEs for DeepSeek R1 (671B) |
| Safety Impact | Low-Medium (current) | Promising but DeepMind deprioritized after SAEs underperformed linear probes on downstream safety tasks |
| Capability Uplift | Neutral | Analysis tool only; does not improve model capabilities |
| Deception Detection | Partial | Safety-relevant features identified (lying, sycophancy, power-seeking); causal validation 75-85% success |
| SI Readiness | Unknown | Depends on whether interpretability scales to superintelligent systems; fundamental limits untested |
| Grade | B | High potential, strong feature discovery; unproven for operational safety applications |
Research Comparison: Major SAE Studies
Section titled “Research Comparison: Major SAE Studies”The following table compares key research efforts in sparse autoencoder development across major AI labs:
| Organization | Publication | Model Target | Features Extracted | Key Findings | Scale/Cost |
|---|---|---|---|---|---|
| Anthropic | May 2024 | Claude 3 Sonnet | 34 million | 90% of high-activating features interpretable; safety-relevant features for deception, sycophancy identified | 8B activations; 16x expansion |
| OpenAI | June 2024 | GPT-4 | 16 million | Smooth scaling laws; k-sparse architecture eliminates dead latents; 10% compute equivalence loss | 40B tokens trained |
| DeepMind | August 2024 | Gemma 2 (2B-27B) | 30+ million | JumpReLU architecture; open-source release for community research | 20 PiB activations; 15% of Gemma 2 9B training compute |
| DeepMind | December 2024 | Gemma 3 (270M-27B) | 1+ trillion params | Combines SAEs with transcoders; analyzes jailbreaks and chain-of-thought | 110 PB activation data |
| Goodfire | January 2025 | Llama 3.1 8B, Llama 3.3 70B | Not disclosed | State-of-the-art open-source SAEs; granular behavior steering | Open-source release |
| Goodfire | April 2025 | DeepSeek R1 (671B) | Not disclosed | First SAEs on true reasoning model; qualitatively different from non-reasoning models | 671B parameter scale |
| EleutherAI | 2024 | GPT-2, open models | 1.5 million | Automated interpretation at $1,300 (Llama 3.1) vs $8,500 (Claude 3.5); open tools | 97% cost reduction vs prior methods |
Investment Landscape
Section titled “Investment Landscape”| Organization | Estimated Investment | Team Size | Focus Areas | Key Products/Releases |
|---|---|---|---|---|
| Anthropic | $25-40M/year | 30-50 FTE | Scaling monosemanticity, circuit tracing, safety features | Claude SAEs, Attribution Graphs |
| Google DeepMind | $15-25M/year | 20-30 FTE | Open-source tools, benchmarking, architectural innovation | Gemma Scope 1 & 2 (110 PB data) |
| OpenAI | $10-15M/year | 10-20 FTE | Scaling laws, k-sparse architecture | GPT-4 SAEs, TopK methods |
| Goodfire | $50M raised (Series A) | 15-25 FTE | Commercial interpretability, open-source models | Ember platform, Llama/DeepSeek SAEs |
| Academic Sector | $10-20M/year | 30-50 FTE | Theoretical foundations, benchmarking, applications | MIT thesis work, InterpBench |
| Total Global | $75-150M/year | 150-200 FTE | — | — |
SAE Architecture Evolution
Section titled “SAE Architecture Evolution”| Architecture | Year | Key Innovation | Trade-offs |
|---|---|---|---|
| Vanilla ReLU + L1 | 2023 | Original formulation | Dead latents; requires penalty tuning |
| Gated SAE | 2024 | Separate magnitude/selection paths | Better reconstruction-sparsity frontier |
| JumpReLU | 2024 | Threshold activation function | State-of-the-art for Gemma Scope |
| BatchTopK | 2024 | Directly set sparsity without penalty | Few dead latents; training stability |
| Transcoders | 2025 | Predict next-layer activations | Better for analyzing computations vs representations |
How SAEs Work
Section titled “How SAEs Work”Technical Architecture
Section titled “Technical Architecture”SAEs are encoder-decoder neural networks trained to reconstruct activation vectors through a sparse intermediate representation:
The key innovation is the sparsity constraint: during training, the encoder is penalized for activating too many features simultaneously. This forces the network to find a small set of highly relevant features for any given input, naturally leading to monosemantic representations where each feature captures a distinct concept.
SAE Research Pipeline
Section titled “SAE Research Pipeline”The following diagram illustrates the complete pipeline from training SAEs to safety applications:
Training Process
Section titled “Training Process”| Stage | Description | Computational Cost |
|---|---|---|
| Activation Collection | Record model activations on billions of tokens | Storage-intensive (110 PB for Gemma Scope) |
| SAE Training | Train encoder-decoder with L1 sparsity penalty | $1-10M compute for frontier models |
| Feature Analysis | Automated labeling using interpretable AI | $1,300-8,500 per 1.5M features |
| Validation | Verify features have causal effects via steering | Medium compute; manual effort |
Key Technical Parameters
Section titled “Key Technical Parameters”- Expansion Factor: Ratio of SAE dictionary size to original activation dimensions (typically 4-64x)
- Sparsity Penalty (L1): Strength of penalty on feature activation; higher values yield sparser but potentially less accurate reconstructions
- Reconstruction Loss: How well SAE outputs match original activations; fundamental accuracy metric
- Dead Features: Features that never activate; indicator of training difficulties
Major Research Milestones
Section titled “Major Research Milestones”Anthropic’s Scaling Monosemanticity (May 2024)
Section titled “Anthropic’s Scaling Monosemanticity (May 2024)”The landmark result that demonstrated SAEs work at frontier model scale. This represented a major scaling milestone—eight months prior, Anthropic had only demonstrated SAEs on a small one-layer transformer, and it was unclear whether the method would scale to production models.
| Metric | Result | Context |
|---|---|---|
| Model | Claude 3 Sonnet | 3.0 version released March 2024 |
| Features Extracted | 1M, 4M, and 34M | Three SAE sizes tested |
| Automated Interpretability Score | 90% | High-activating features with clear explanations |
| Training Data | 8 billion residual-stream activations | From diverse text corpus |
| Expansion Factor | 83x to 2833x | Ratio of features to residual stream dimension |
| Average Features per Token | ≈300 | Sparse from thousands of dense activations |
The resulting features exhibit remarkable abstraction: they are multilingual, multimodal, and generalize between concrete and abstract references. Critically, researchers found safety-relevant features including:
| Feature Category | Examples Found | Safety Relevance |
|---|---|---|
| Deception | Lying, dishonesty patterns | Direct alignment concern |
| Security | Code backdoors, vulnerabilities | Dual-use risk |
| Manipulation | Persuasion, bias injection | Influence operations |
| Power-seeking | Goal-directed behavior patterns | Instrumental convergence |
| Sycophancy | Agreement regardless of truth | Reward hacking indicator |
DeepMind’s Gemma Scope (August 2024) and Gemma Scope 2 (December 2024)
Section titled “DeepMind’s Gemma Scope (August 2024) and Gemma Scope 2 (December 2024)”Gemma Scope represented the first major open-source SAE release, followed by the substantially larger Gemma Scope 2—described as the largest open-source interpretability release by an AI lab to date.
| Metric | Gemma Scope (Aug 2024) | Gemma Scope 2 (Dec 2024) |
|---|---|---|
| Model Coverage | Gemma 2 (2B, 9B, 27B) | Gemma 3 (270M to 27B) |
| Total Features | 30+ million | Comparable scale |
| Training Compute | ≈15% of Gemma 2 9B training | Not disclosed |
| Storage Requirements | 20 PiB activations | 110 PB activation data |
| Total Parameters | Hundreds of billions | 1+ trillion across all SAEs |
| Architecture | JumpReLU SAEs | SAEs + Transcoders |
Gemma Scope 2 introduces transcoders alongside SAEs—a key advancement that predicts next-layer activations rather than reconstructing current activations. This enables analysis of multi-step computations and behaviors like chain-of-thought faithfulness. The release specifically targets safety-relevant capabilities: analyzing jailbreak mechanisms, understanding refusal behaviors, and evaluating reasoning faithfulness.
2025 Developments: Expansion and Critical Assessment
Section titled “2025 Developments: Expansion and Critical Assessment”2025 has seen both major expansion and critical reassessment of SAE research. Goodfire’s $50M Series A in April 2025 represented the largest dedicated interpretability investment to date, while DeepMind’s deprioritization announcement highlighted ongoing challenges.
| Development | Date | Significance |
|---|---|---|
| Goodfire SAEs for Llama 3.1/3.3 | January 2025 | First high-quality open-source SAEs for frontier Llama models |
| A Survey on Sparse Autoencoders | March 2025 | Comprehensive academic review of SAE methods, training, and evaluation |
| DeepMind deprioritization | March 2025 | SAEs underperformed linear probes on safety tasks; team shifted focus |
| Goodfire $50M Series A | April 2025 | Largest dedicated interpretability funding; Ember platform expansion |
| Goodfire DeepSeek R1 SAEs | April 2025 | First SAEs on 671B reasoning model; revealed qualitative differences |
| Protein language model SAEs | 2025 | SAE techniques extended to biological foundation models |
Negative Results and Limitations
Section titled “Negative Results and Limitations”DeepMind’s March 2025 announcement that they were deprioritizing SAE research highlights important limitations:
| Finding | Implication | DeepMind Assessment |
|---|---|---|
| SAEs underperformed linear probes on harmful intent detection | Simpler methods may suffice for some safety applications | ”Linear probes are actually really good, cheap, and perform great” |
| Chat-specialized SAEs closed only ≈50% of the gap | Domain-specific training helps but doesn’t solve the problem | Still worse than linear probes |
| Features may not be functionally important | Interpretable features ≠ causally relevant features | ”Do not think SAEs will be a game-changer” |
| High training costs with diminishing returns | Compute-intensive with unclear safety ROI | Team shifted to model diffing, deception model organisms |
However, DeepMind noted SAEs remain “fairly helpful for debugging low quality datasets (noticing spurious correlations)” and left open the possibility of returning to SAE research “if there is significant progress on some of SAEs’ core issues.”
Quantified Performance Metrics
Section titled “Quantified Performance Metrics”| Metric | Value | Source | Notes |
|---|---|---|---|
| Feature interpretability rate | 90% | Anthropic 2024 | High-activating features with clear explanations |
| Average features per token | ≈300 | Anthropic 2024 | Sparse representation from thousands of dense activations |
| Reconstruction loss (GPT-4 SAE) | 10% compute equivalent | OpenAI 2024 | Language modeling loss increase when SAE substituted |
| Automated interpretation cost | $1,300-8,500 per 1.5M features | EleutherAI 2024 | Llama 3.1 vs Claude 3.5 Sonnet |
| Prior interpretation methods | ≈$200,000 per 1.5M features | EleutherAI 2024 | 97% cost reduction achieved |
| Storage requirements (Gemma Scope 2) | 110 PB | DeepMind 2024 | Largest open-source interpretability release |
| Dead latent rate (TopK architecture) | Near zero | OpenAI 2024 | vs significant dead features with ReLU+L1 |
Activation Steering Applications
Section titled “Activation Steering Applications”SAE features enable direct intervention in model behavior by clamping or modifying specific feature activations during inference. This “steering” capability represents a practical application beyond interpretability:
| Steering Application | Method | Effectiveness | Trade-offs |
|---|---|---|---|
| Refusal steering | Amplify refusal-mediating features | Improves jailbreak resistance | Degrades general capabilities |
| SAE-Targeted Steering | Optimize steering vectors for specific effects | Outperforms baseline methods | Requires feature identification |
| Feature Guided Activation Additions | SAE-guided vector construction | Better coherence than CAA | More complex pipeline |
| Graph-regularized SAEs (GSAE) | Spectral vector bank with dual-gating | Selective, stable steering | Added architectural complexity |
Research has uncovered a fundamental tension: features mediating safety behaviors like refusal appear entangled with general capabilities. Steering for improved safety often degrades benchmark performance, suggesting safety-relevant features may not be cleanly separable from capability-relevant ones.
Safety Applications
Section titled “Safety Applications”Deception Detection
Section titled “Deception Detection”SAEs could enable direct detection of deceptive cognition by identifying when deception-related features activate:
| Application | Mechanism | Current Status |
|---|---|---|
| Runtime Monitoring | Flag when deception features activate during inference | Theoretical; not deployed |
| Training-time Detection | Identify deceptive patterns during fine-tuning | Experimental research |
| Alignment Verification | Confirm models have learned intended values | Early-stage research |
| Red-teaming Augmentation | Find adversarial prompts that activate concerning features | Growing usage |
Limitations for Safety
Section titled “Limitations for Safety”Even with successful SAE development, fundamental challenges remain:
- Coverage: SAEs may not capture all safety-relevant features
- Adversarial Robustness: Sophisticated models might learn to hide concerning cognition
- Interpretation Accuracy: Human labels may not capture true feature meanings
- Causal Relevance: Features that activate for deception may not cause deceptive outputs
Key Research Groups
Section titled “Key Research Groups”| Organization | Focus | Key Contributions | Scale |
|---|---|---|---|
| Anthropic | Leading SAE development and scaling | Scaling Monosemanticity; Claude SAEs; original monosemanticity work | 34M features; frontier models |
| OpenAI | Scaling methodology | GPT-4 SAEs; TopK architecture; scaling laws research | 16M latents; 40B token training |
| DeepMind | Open-source tools and benchmarking | Gemma Scope 1 & 2; JumpReLU; negative results | 1T+ parameters; 110 PB data |
| Goodfire | Commercial interpretability, open-source SAEs | DeepSeek R1 SAEs; Llama SAEs; Ember platform | $50M Series A; 671B model scale |
| EleutherAI | Open-source interpretability | Automated interpretation; Delphi; Sparsify; cost reduction | 97% cost reduction |
| MATS Alumni | Foundational research | Original SAE paper; SAELens; community tools | Open-source ecosystem |
| Neuronpedia | Visualization and tooling | Feature explorer; 50M+ searchable latents; API access | 4+ TB of data hosted |
Research Ecosystem
Section titled “Research Ecosystem”Arguments For Prioritization
Section titled “Arguments For Prioritization”- Unique Capability: SAEs may be necessary for detecting sophisticated deception that behavioral evals cannot catch
- No Capability Uplift: Pure safety research with minimal dual-use concerns
- Proven at Scale: Works on 100B+ parameter models, suggesting path to frontier
- Foundation for Other Work: Enables representation engineering, activation steering, and monitoring
Arguments Against Prioritization
Section titled “Arguments Against Prioritization”- May Not Scale: Fundamental limits on interpretability possible
- Expensive: Significant compute and researcher time required
- Limited Safety Impact So Far: No operational safety applications despite years of research
- Alternative Approaches: Linear probes and behavioral methods may be more cost-effective
- False Confidence Risk: Partial interpretability might create false assurance
Key Uncertainties
Section titled “Key Uncertainties”| Uncertainty | Current Evidence | Importance | Resolution Timeline |
|---|---|---|---|
| Causal relevance of features | Mixed; steering works but effects entangled | Critical for safety applications | 2025-2027 |
| Adversarial robustness | Untested; models could learn to evade feature detection | High for deployment | Unknown |
| Coverage completeness | Current SAEs capture subset of model behavior | Medium; partial coverage may suffice | 2025-2026 |
| Scaling to superintelligent systems | No evidence; extrapolation uncertain | Very high for long-term safety | Depends on AI timeline |
| Transcoders vs SAEs | Early evidence favors transcoders for some applications | Medium; may be complementary | 2025 |
| Feature universality across models | Similar features found across architectures | Medium for transfer learning | 2025-2026 |
Risks Addressed
Section titled “Risks Addressed”| Risk | Mechanism | Effectiveness |
|---|---|---|
| Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 | Detect deception-related features during inference | Medium-High (if scalable) |
| SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100 | Find evidence of strategic deception in internal representations | Medium-High |
| Mesa-OptimizationRiskMesa-OptimizationMesa-optimization—where AI systems develop internal optimizers with different objectives than training goals—shows concerning empirical evidence: Claude exhibited alignment faking in 12-78% of moni...Quality: 63/100 | Identify mesa-objectives in model internals | Medium |
| Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100 | Detect proxy optimization vs. true goals | Medium |
| Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100 | Understand learned goal representations | Medium |
Recommendation
Section titled “Recommendation”Recommendation Level: INCREASE
SAEs represent one of the most promising technical approaches to the fundamental problem of understanding AI cognition. While current safety applications remain limited, the potential for detecting sophisticated deception justifies increased investment. The technique has no meaningful capability uplift, making it a safe area for expanded research.
Priority areas for additional investment:
- Scaling to larger models and more comprehensive feature coverage
- Developing robust automated evaluation methods
- Building operational monitoring systems based on SAE features
- Investigating adversarial robustness of SAE-based detection
Related Approaches
Section titled “Related Approaches”- InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100 - Parent field; SAEs are a key technique within mechanistic interpretability
- ProbingProbingLinear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vu...Quality: 55/100 - Alternative method using linear probes; DeepMind found probes outperform SAEs on some tasks
- Representation EngineeringRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100 - Uses SAE-discovered features for activation steering
- AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100 - Complementary defense-in-depth approach if interpretability misses deception
Sources & Resources
Section titled “Sources & Resources”Primary Research
Section titled “Primary Research”| Source | Organization | Date | Key Contribution |
|---|---|---|---|
| Scaling Monosemanticity | Anthropic | May 2024 | 34M features from Claude 3 Sonnet; safety-relevant feature discovery |
| Extracting Concepts from GPT-4 | OpenAI | June 2024 | 16M latent SAE; scaling laws; TopK architecture |
| Scaling and Evaluating Sparse Autoencoders | OpenAI | June 2024 | Technical methodology paper; k-sparse autoencoders |
| Gemma Scope | DeepMind | August 2024 | Open-source SAEs for Gemma 2; JumpReLU architecture |
| Gemma Scope 2 | DeepMind | December 2024 | Largest open release; SAEs + transcoders for Gemma 3 |
| Sparse Autoencoders Find Highly Interpretable Features | Anthropic/MATS | September 2023 | Foundational SAE methodology paper |
Critical Perspectives and Limitations
Section titled “Critical Perspectives and Limitations”| Source | Organization | Date | Key Finding |
|---|---|---|---|
| Negative Results for SAEs on Downstream Tasks | DeepMind | March 2025 | SAEs underperform linear probes; led to SAE research deprioritization |
| Open Source Automated Interpretability | EleutherAI | 2024 | 97% cost reduction for feature interpretation; open tools |
| Transcoders Beat Sparse Autoencoders | Various | January 2025 | Skip transcoders Pareto-dominate SAEs for interpretability |
Tools and Platforms
Section titled “Tools and Platforms”| Tool | URL | Description |
|---|---|---|
| Neuronpedia | neuronpedia.org | Interactive SAE feature explorer; 50M+ searchable latents; live inference testing |
| SAELens | github.com/jbloomAus/SAELens | SAE training library; supports multiple architectures |
| TransformerLens | github.com/neelnanda-io/TransformerLens | Interpretability library with SAE integration |
| Delphi | github.com/EleutherAI/delphi | Automated feature interpretation pipeline |
| EleutherAI Sparsify | github.com/EleutherAI/sparsify | On-the-fly activation training without caching |
Foundational Reading
Section titled “Foundational Reading”- Towards Monosemanticity (Anthropic, 2023) - Original demonstration of SAEs extracting interpretable features from a one-layer transformer
- An Intuitive Explanation of Sparse Autoencoders (Adam Karvonen, 2024) - Accessible introduction to SAE concepts
- A Survey on Sparse Autoencoders (2025) - Comprehensive review of SAE methods and applications
- MIT Thesis: Towards More Interpretable AI With Sparse Autoencoders (Engels, 2025) - Academic treatment of multi-dimensional features