Longterm Wiki

Mechanistic Interpretability

AnthropicResearch Areasmechanistic-interpretability

Record Metadata

Record Keymechanistic-interpretability
EntityAnthropic
CollectionResearch Areas(6 records total)
SchemaMajor research initiatives and focus areas.
YAML Filepackages/kb/data/things/mK9pX3rQ7n.yaml

Fields

NameMechanistic Interpretability
DescriptionUnderstanding neural network internals through reverse-engineering
Team Size50
StartedJan 2021
Key Publicationtransformer-circuits.pub
NotesLed by Chris Olah; MIT Tech Review 2026 Breakthrough Technology; 34M features identified

Other Records in Research Areas (5)

KeyNameDescriptionTeam Size
constitutional-aiConstitutional AITraining AI systems to follow principles through self-critique and RLAIF
alignment-scienceAlignment ScienceScalable oversight, weak-to-strong generalization, robustness to jailbreaks
responsible-scaling-policyResponsible Scaling PolicyFramework for evaluating and mitigating risks at each capability level
sleeper-agentsSleeper Agents ResearchInvestigating whether AI systems can maintain hidden behaviors through training
ai-welfareAI Welfare ResearchInvestigating moral status and welfare considerations for AI systems
Record: mechanistic-interpretability | Longterm Wiki