Skip to content

Sparse Autoencoders (SAEs)

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:91 (Comprehensive)
Importance:72.5 (High)
Last edited:2026-01-30 (2 days ago)
Words:3.2k
Structure:
📊 19📈 3🔗 9📚 639%Score: 15/15
LLM Summary:Comprehensive review of sparse autoencoders (SAEs) for mechanistic interpretability, covering Anthropic's 34M features from Claude 3 Sonnet (90% interpretability), OpenAI's 16M latent GPT-4 SAEs, DeepMind's 1T+ parameter Gemma Scope releases, and Goodfire's \$50M Series A and 671B DeepSeek R1 SAEs. Despite promising safety applications including deception detection features, DeepMind's March 2025 negative results showed SAEs underperforming simple probes on downstream tasks. Global investment estimated at \$75-150M/year with 150-200 researchers.
Issues (1):
  • Links27 links could use <R> components
See also:LessWrong

Sparse Autoencoders (SAEs) represent a breakthrough technique in mechanistic interpretability that addresses the fundamental challenge of neural network polysemanticity. In modern language models, individual neurons often respond to multiple unrelated concepts (e.g., a single neuron activating for both “the Golden Gate Bridge” and “requests for help”), making direct interpretation of neural activations extremely difficult. SAEs solve this by learning to decompose dense, polysemantic activations into sparse, monosemantic feature vectors where each dimension corresponds to a single interpretable concept.

The technique works by training an auxiliary neural network to reconstruct model activations through a bottleneck that encourages sparse representations. When trained on billions of activation samples, SAEs discover features that correspond to human-interpretable concepts ranging from concrete entities like “San Francisco” to abstract notions like “deception in political contexts.” Anthropic’s landmark 2024 work extracted over 34 million interpretable features from Claude 3 Sonnet, with automated evaluation finding that 90% of high-activating features have clear human-interpretable explanations.

For AI safety, SAEs offer a potentially transformative capability: the ability to directly detect safety-relevant cognition inside models. Researchers have identified features corresponding to lying, manipulation, security vulnerabilities, power-seeking behavior, and sycophancy. If SAE research scales successfully, it could provide the foundation for runtime monitoring systems that flag concerning internal states, deception detection during training, and verification that alignment techniques actually work at the mechanistic level.

DimensionAssessmentEvidence
Research InvestmentHigh ($50-100M/yr)Goodfire raised $50M Series A (April 2025); Anthropic, DeepMind, OpenAI dedicated teams
Feature Extraction Scale34M+ featuresAnthropic 2024 extracted 34M from Claude 3 Sonnet with 90% interpretability scores
Model Coverage100B+ parametersWorks on Claude 3 Sonnet, GPT-4, Gemma 2/3 up to 27B; Goodfire released SAEs for DeepSeek R1 (671B)
Safety ImpactLow-Medium (current)Promising but DeepMind deprioritized after SAEs underperformed linear probes on downstream safety tasks
Capability UpliftNeutralAnalysis tool only; does not improve model capabilities
Deception DetectionPartialSafety-relevant features identified (lying, sycophancy, power-seeking); causal validation 75-85% success
SI ReadinessUnknownDepends on whether interpretability scales to superintelligent systems; fundamental limits untested
GradeBHigh potential, strong feature discovery; unproven for operational safety applications

The following table compares key research efforts in sparse autoencoder development across major AI labs:

OrganizationPublicationModel TargetFeatures ExtractedKey FindingsScale/Cost
AnthropicMay 2024Claude 3 Sonnet34 million90% of high-activating features interpretable; safety-relevant features for deception, sycophancy identified8B activations; 16x expansion
OpenAIJune 2024GPT-416 millionSmooth scaling laws; k-sparse architecture eliminates dead latents; 10% compute equivalence loss40B tokens trained
DeepMindAugust 2024Gemma 2 (2B-27B)30+ millionJumpReLU architecture; open-source release for community research20 PiB activations; 15% of Gemma 2 9B training compute
DeepMindDecember 2024Gemma 3 (270M-27B)1+ trillion paramsCombines SAEs with transcoders; analyzes jailbreaks and chain-of-thought110 PB activation data
GoodfireJanuary 2025Llama 3.1 8B, Llama 3.3 70BNot disclosedState-of-the-art open-source SAEs; granular behavior steeringOpen-source release
GoodfireApril 2025DeepSeek R1 (671B)Not disclosedFirst SAEs on true reasoning model; qualitatively different from non-reasoning models671B parameter scale
EleutherAI2024GPT-2, open models1.5 millionAutomated interpretation at $1,300 (Llama 3.1) vs $8,500 (Claude 3.5); open tools97% cost reduction vs prior methods
OrganizationEstimated InvestmentTeam SizeFocus AreasKey Products/Releases
Anthropic$25-40M/year30-50 FTEScaling monosemanticity, circuit tracing, safety featuresClaude SAEs, Attribution Graphs
Google DeepMind$15-25M/year20-30 FTEOpen-source tools, benchmarking, architectural innovationGemma Scope 1 & 2 (110 PB data)
OpenAI$10-15M/year10-20 FTEScaling laws, k-sparse architectureGPT-4 SAEs, TopK methods
Goodfire$50M raised (Series A)15-25 FTECommercial interpretability, open-source modelsEmber platform, Llama/DeepSeek SAEs
Academic Sector$10-20M/year30-50 FTETheoretical foundations, benchmarking, applicationsMIT thesis work, InterpBench
Total Global$75-150M/year150-200 FTE
ArchitectureYearKey InnovationTrade-offs
Vanilla ReLU + L12023Original formulationDead latents; requires penalty tuning
Gated SAE2024Separate magnitude/selection pathsBetter reconstruction-sparsity frontier
JumpReLU2024Threshold activation functionState-of-the-art for Gemma Scope
BatchTopK2024Directly set sparsity without penaltyFew dead latents; training stability
Transcoders2025Predict next-layer activationsBetter for analyzing computations vs representations

SAEs are encoder-decoder neural networks trained to reconstruct activation vectors through a sparse intermediate representation:

Loading diagram...

The key innovation is the sparsity constraint: during training, the encoder is penalized for activating too many features simultaneously. This forces the network to find a small set of highly relevant features for any given input, naturally leading to monosemantic representations where each feature captures a distinct concept.

The following diagram illustrates the complete pipeline from training SAEs to safety applications:

Loading diagram...
StageDescriptionComputational Cost
Activation CollectionRecord model activations on billions of tokensStorage-intensive (110 PB for Gemma Scope)
SAE TrainingTrain encoder-decoder with L1 sparsity penalty$1-10M compute for frontier models
Feature AnalysisAutomated labeling using interpretable AI$1,300-8,500 per 1.5M features
ValidationVerify features have causal effects via steeringMedium compute; manual effort
  • Expansion Factor: Ratio of SAE dictionary size to original activation dimensions (typically 4-64x)
  • Sparsity Penalty (L1): Strength of penalty on feature activation; higher values yield sparser but potentially less accurate reconstructions
  • Reconstruction Loss: How well SAE outputs match original activations; fundamental accuracy metric
  • Dead Features: Features that never activate; indicator of training difficulties

Anthropic’s Scaling Monosemanticity (May 2024)

Section titled “Anthropic’s Scaling Monosemanticity (May 2024)”

The landmark result that demonstrated SAEs work at frontier model scale. This represented a major scaling milestone—eight months prior, Anthropic had only demonstrated SAEs on a small one-layer transformer, and it was unclear whether the method would scale to production models.

MetricResultContext
ModelClaude 3 Sonnet3.0 version released March 2024
Features Extracted1M, 4M, and 34MThree SAE sizes tested
Automated Interpretability Score90%High-activating features with clear explanations
Training Data8 billion residual-stream activationsFrom diverse text corpus
Expansion Factor83x to 2833xRatio of features to residual stream dimension
Average Features per Token≈300Sparse from thousands of dense activations

The resulting features exhibit remarkable abstraction: they are multilingual, multimodal, and generalize between concrete and abstract references. Critically, researchers found safety-relevant features including:

Feature CategoryExamples FoundSafety Relevance
DeceptionLying, dishonesty patternsDirect alignment concern
SecurityCode backdoors, vulnerabilitiesDual-use risk
ManipulationPersuasion, bias injectionInfluence operations
Power-seekingGoal-directed behavior patternsInstrumental convergence
SycophancyAgreement regardless of truthReward hacking indicator

DeepMind’s Gemma Scope (August 2024) and Gemma Scope 2 (December 2024)

Section titled “DeepMind’s Gemma Scope (August 2024) and Gemma Scope 2 (December 2024)”

Gemma Scope represented the first major open-source SAE release, followed by the substantially larger Gemma Scope 2—described as the largest open-source interpretability release by an AI lab to date.

MetricGemma Scope (Aug 2024)Gemma Scope 2 (Dec 2024)
Model CoverageGemma 2 (2B, 9B, 27B)Gemma 3 (270M to 27B)
Total Features30+ millionComparable scale
Training Compute≈15% of Gemma 2 9B trainingNot disclosed
Storage Requirements20 PiB activations110 PB activation data
Total ParametersHundreds of billions1+ trillion across all SAEs
ArchitectureJumpReLU SAEsSAEs + Transcoders

Gemma Scope 2 introduces transcoders alongside SAEs—a key advancement that predicts next-layer activations rather than reconstructing current activations. This enables analysis of multi-step computations and behaviors like chain-of-thought faithfulness. The release specifically targets safety-relevant capabilities: analyzing jailbreak mechanisms, understanding refusal behaviors, and evaluating reasoning faithfulness.

2025 Developments: Expansion and Critical Assessment

Section titled “2025 Developments: Expansion and Critical Assessment”

2025 has seen both major expansion and critical reassessment of SAE research. Goodfire’s $50M Series A in April 2025 represented the largest dedicated interpretability investment to date, while DeepMind’s deprioritization announcement highlighted ongoing challenges.

DevelopmentDateSignificance
Goodfire SAEs for Llama 3.1/3.3January 2025First high-quality open-source SAEs for frontier Llama models
A Survey on Sparse AutoencodersMarch 2025Comprehensive academic review of SAE methods, training, and evaluation
DeepMind deprioritizationMarch 2025SAEs underperformed linear probes on safety tasks; team shifted focus
Goodfire $50M Series AApril 2025Largest dedicated interpretability funding; Ember platform expansion
Goodfire DeepSeek R1 SAEsApril 2025First SAEs on 671B reasoning model; revealed qualitative differences
Protein language model SAEs2025SAE techniques extended to biological foundation models

DeepMind’s March 2025 announcement that they were deprioritizing SAE research highlights important limitations:

FindingImplicationDeepMind Assessment
SAEs underperformed linear probes on harmful intent detectionSimpler methods may suffice for some safety applications”Linear probes are actually really good, cheap, and perform great”
Chat-specialized SAEs closed only ≈50% of the gapDomain-specific training helps but doesn’t solve the problemStill worse than linear probes
Features may not be functionally importantInterpretable features ≠ causally relevant features”Do not think SAEs will be a game-changer”
High training costs with diminishing returnsCompute-intensive with unclear safety ROITeam shifted to model diffing, deception model organisms

However, DeepMind noted SAEs remain “fairly helpful for debugging low quality datasets (noticing spurious correlations)” and left open the possibility of returning to SAE research “if there is significant progress on some of SAEs’ core issues.”

MetricValueSourceNotes
Feature interpretability rate90%Anthropic 2024High-activating features with clear explanations
Average features per token≈300Anthropic 2024Sparse representation from thousands of dense activations
Reconstruction loss (GPT-4 SAE)10% compute equivalentOpenAI 2024Language modeling loss increase when SAE substituted
Automated interpretation cost$1,300-8,500 per 1.5M featuresEleutherAI 2024Llama 3.1 vs Claude 3.5 Sonnet
Prior interpretation methods≈$200,000 per 1.5M featuresEleutherAI 202497% cost reduction achieved
Storage requirements (Gemma Scope 2)110 PBDeepMind 2024Largest open-source interpretability release
Dead latent rate (TopK architecture)Near zeroOpenAI 2024vs significant dead features with ReLU+L1

SAE features enable direct intervention in model behavior by clamping or modifying specific feature activations during inference. This “steering” capability represents a practical application beyond interpretability:

Steering ApplicationMethodEffectivenessTrade-offs
Refusal steeringAmplify refusal-mediating featuresImproves jailbreak resistanceDegrades general capabilities
SAE-Targeted SteeringOptimize steering vectors for specific effectsOutperforms baseline methodsRequires feature identification
Feature Guided Activation AdditionsSAE-guided vector constructionBetter coherence than CAAMore complex pipeline
Graph-regularized SAEs (GSAE)Spectral vector bank with dual-gatingSelective, stable steeringAdded architectural complexity

Research has uncovered a fundamental tension: features mediating safety behaviors like refusal appear entangled with general capabilities. Steering for improved safety often degrades benchmark performance, suggesting safety-relevant features may not be cleanly separable from capability-relevant ones.

SAEs could enable direct detection of deceptive cognition by identifying when deception-related features activate:

ApplicationMechanismCurrent Status
Runtime MonitoringFlag when deception features activate during inferenceTheoretical; not deployed
Training-time DetectionIdentify deceptive patterns during fine-tuningExperimental research
Alignment VerificationConfirm models have learned intended valuesEarly-stage research
Red-teaming AugmentationFind adversarial prompts that activate concerning featuresGrowing usage

Even with successful SAE development, fundamental challenges remain:

  • Coverage: SAEs may not capture all safety-relevant features
  • Adversarial Robustness: Sophisticated models might learn to hide concerning cognition
  • Interpretation Accuracy: Human labels may not capture true feature meanings
  • Causal Relevance: Features that activate for deception may not cause deceptive outputs
OrganizationFocusKey ContributionsScale
AnthropicLeading SAE development and scalingScaling Monosemanticity; Claude SAEs; original monosemanticity work34M features; frontier models
OpenAIScaling methodologyGPT-4 SAEs; TopK architecture; scaling laws research16M latents; 40B token training
DeepMindOpen-source tools and benchmarkingGemma Scope 1 & 2; JumpReLU; negative results1T+ parameters; 110 PB data
GoodfireCommercial interpretability, open-source SAEsDeepSeek R1 SAEs; Llama SAEs; Ember platform$50M Series A; 671B model scale
EleutherAIOpen-source interpretabilityAutomated interpretation; Delphi; Sparsify; cost reduction97% cost reduction
MATS AlumniFoundational researchOriginal SAE paper; SAELens; community toolsOpen-source ecosystem
NeuronpediaVisualization and toolingFeature explorer; 50M+ searchable latents; API access4+ TB of data hosted
Loading diagram...
  1. Unique Capability: SAEs may be necessary for detecting sophisticated deception that behavioral evals cannot catch
  2. No Capability Uplift: Pure safety research with minimal dual-use concerns
  3. Proven at Scale: Works on 100B+ parameter models, suggesting path to frontier
  4. Foundation for Other Work: Enables representation engineering, activation steering, and monitoring
  1. May Not Scale: Fundamental limits on interpretability possible
  2. Expensive: Significant compute and researcher time required
  3. Limited Safety Impact So Far: No operational safety applications despite years of research
  4. Alternative Approaches: Linear probes and behavioral methods may be more cost-effective
  5. False Confidence Risk: Partial interpretability might create false assurance
UncertaintyCurrent EvidenceImportanceResolution Timeline
Causal relevance of featuresMixed; steering works but effects entangledCritical for safety applications2025-2027
Adversarial robustnessUntested; models could learn to evade feature detectionHigh for deploymentUnknown
Coverage completenessCurrent SAEs capture subset of model behaviorMedium; partial coverage may suffice2025-2026
Scaling to superintelligent systemsNo evidence; extrapolation uncertainVery high for long-term safetyDepends on AI timeline
Transcoders vs SAEsEarly evidence favors transcoders for some applicationsMedium; may be complementary2025
Feature universality across modelsSimilar features found across architecturesMedium for transfer learning2025-2026
RiskMechanismEffectiveness
Deceptive AlignmentDetect deception-related features during inferenceMedium-High (if scalable)
SchemingFind evidence of strategic deception in internal representationsMedium-High
Mesa-OptimizationIdentify mesa-objectives in model internalsMedium
Reward HackingDetect proxy optimization vs. true goalsMedium
Goal MisgeneralizationUnderstand learned goal representationsMedium

Recommendation Level: INCREASE

SAEs represent one of the most promising technical approaches to the fundamental problem of understanding AI cognition. While current safety applications remain limited, the potential for detecting sophisticated deception justifies increased investment. The technique has no meaningful capability uplift, making it a safe area for expanded research.

Priority areas for additional investment:

  • Scaling to larger models and more comprehensive feature coverage
  • Developing robust automated evaluation methods
  • Building operational monitoring systems based on SAE features
  • Investigating adversarial robustness of SAE-based detection
  • Interpretability - Parent field; SAEs are a key technique within mechanistic interpretability
  • Probing - Alternative method using linear probes; DeepMind found probes outperform SAEs on some tasks
  • Representation Engineering - Uses SAE-discovered features for activation steering
  • AI Control - Complementary defense-in-depth approach if interpretability misses deception
SourceOrganizationDateKey Contribution
Scaling MonosemanticityAnthropicMay 202434M features from Claude 3 Sonnet; safety-relevant feature discovery
Extracting Concepts from GPT-4OpenAIJune 202416M latent SAE; scaling laws; TopK architecture
Scaling and Evaluating Sparse AutoencodersOpenAIJune 2024Technical methodology paper; k-sparse autoencoders
Gemma ScopeDeepMindAugust 2024Open-source SAEs for Gemma 2; JumpReLU architecture
Gemma Scope 2DeepMindDecember 2024Largest open release; SAEs + transcoders for Gemma 3
Sparse Autoencoders Find Highly Interpretable FeaturesAnthropic/MATSSeptember 2023Foundational SAE methodology paper
SourceOrganizationDateKey Finding
Negative Results for SAEs on Downstream TasksDeepMindMarch 2025SAEs underperform linear probes; led to SAE research deprioritization
Open Source Automated InterpretabilityEleutherAI202497% cost reduction for feature interpretation; open tools
Transcoders Beat Sparse AutoencodersVariousJanuary 2025Skip transcoders Pareto-dominate SAEs for interpretability
ToolURLDescription
Neuronpedianeuronpedia.orgInteractive SAE feature explorer; 50M+ searchable latents; live inference testing
SAELensgithub.com/jbloomAus/SAELensSAE training library; supports multiple architectures
TransformerLensgithub.com/neelnanda-io/TransformerLensInterpretability library with SAE integration
Delphigithub.com/EleutherAI/delphiAutomated feature interpretation pipeline
EleutherAI Sparsifygithub.com/EleutherAI/sparsifyOn-the-fly activation training without caching