Interpretability
Page Status
Words:3.8k
Backlinks:15
Structure:
π 9π 1π 40π 17β’9%Score: 14/15
Critical Insights (6):
- Counterint.DeepMind deprioritized sparse autoencoder research in March 2025 after finding SAEs underperformed simple linear probes for detecting harmful intent, highlighting fundamental uncertainty about which interpretability techniques will prove most valuable for safety.S:4.5I:4.0A:4.0
- Quant.Mechanistic interpretability has achieved remarkable scale with Anthropic extracting 34+ million interpretable features from Claude 3 Sonnet at 90% automated interpretability scores, yet still explains less than 5% of frontier model computations.S:4.0I:4.5A:3.5
- ClaimCurrent mechanistic interpretability techniques achieve 50-95% success rates across different methods but face a critical 3-7 year timeline to safety-critical applications while transformative AI may arrive sooner, creating a potential timing mismatch.S:3.5I:4.5A:4.0
Issues (1):
- Links2 links could use <R> components
Interpretability research aims to understand what AI systems are βthinkingβ and why they behave as they do.
Overview:
- InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100: The field and its importance for safety
Mechanistic Approaches:
- Mechanistic InterpretabilityMech InterpMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with $100M+ annual investment across major labs. Anthropic extracted 30M+ features from Cl...Quality: 59/100: Reverse-engineering neural networks
- Sparse AutoencodersSparse AutoencodersComprehensive review of sparse autoencoders (SAEs) for mechanistic interpretability, covering Anthropic's 34M features from Claude 3 Sonnet (90% interpretability), OpenAI's 16M latent GPT-4 SAEs, D...Quality: 91/100: Learning interpretable features
- ProbingProbingLinear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vu...Quality: 55/100: Testing for specific knowledge or concepts
- Circuit BreakersCircuit BreakersCircuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with 1% capability l...Quality: 64/100: Identifying and modifying specific circuits
Representation-Based:
- Representation EngineeringRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100: Controlling behavior via internal representations