Longterm Wiki

Sparse Autoencoders (SAEs)

Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints. Anthropic's 2024 research extracted 34 million features from Claude 3 Sonnet with 90% interpretability scores, while Goodfire raised $50M in 2025 and released first-ever SAEs for the 671B-parameter DeepSeek R1 reasoning model.

Related

Related Pages

Top Related Pages

Safety Research

Anthropic Core Views

Risks

Reward HackingSchemingGoal MisgeneralizationSycophancyMesa-Optimization

Analysis

Model Organisms of MisalignmentCapability-Alignment Race Model

Approaches

Probing / Linear Probes

Other

InterpretabilityAI ControlChris OlahDario Amodei

Organizations

OpenAI

Concepts

Alignment Interpretability OverviewDense Transformers

Key Debates

AI Alignment Research AgendasTechnical AI Safety Research

Historical

Deep Learning Revolution EraMainstream Era

Tags

interpretabilityfeature-extractionmonosemanticityneural-network-analysissafety-tooling