Navigation

Mechanistic Interpretability

Anthropic › Research Areas › mechanistic-interpretability

Record Metadata

Record Key	`mechanistic-interpretability`
Entity	Anthropic
Collection	Research Areas(6 records total)
Schema	Major research initiatives and focus areas.
YAML File	`packages/kb/data/things/mK9pX3rQ7n.yaml`

Fields

Name	Mechanistic Interpretability
Description	Understanding neural network internals through reverse-engineering
Team Size	50
Started	Jan 2021
Key Publication	transformer-circuits.pub↗
Notes	Led by Chris Olah; MIT Tech Review 2026 Breakthrough Technology; 34M features identified

Other Records in Research Areas (5)

Key	Name	Description	Team Size
constitutional-ai	Constitutional AI	Training AI systems to follow principles through self-critique and RLAIF	—
alignment-science	Alignment Science	Scalable oversight, weak-to-strong generalization, robustness to jailbreaks	—
responsible-scaling-policy	Responsible Scaling Policy	Framework for evaluating and mitigating risks at each capability level	—
sleeper-agents	Sleeper Agents Research	Investigating whether AI systems can maintain hidden behaviors through training	—
ai-welfare	AI Welfare Research	Investigating moral status and welfare considerations for AI systems	—

Record: mechanistic-interpretability | Longterm Wiki