Sparse Autoencoders (SAEs)
Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints. Anthropic's 2024 research extracted 34 million features from Claude 3 Sonnet with 90% interpretability scores, while Goodfire raised $50M in 2025 and released first-ever SAEs for the 671B-parameter DeepSeek R1 reasoning model.
Related
Related Pages
Top Related Pages
Representation Engineering
A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enablin...
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
Goodfire
AI interpretability research lab developing tools to decode and control neural network internals for safer AI systems
Mechanistic Interpretability
Mechanistic interpretability reverse-engineers neural networks to understand their internal computations and circuits.