Longterm Wiki

Interpretability

Interpretabilityactive

Umbrella field of research aimed at understanding how AI models work internally.

Organizations
3
Risks Addressed
3
Cluster: Interpretability

Tags

function:assurancescope:field

Organizations3

OrganizationRole
Anthropicpioneer
Google DeepMindactive
Redwood Researchactive

Sub-Areas7

NameStatusOrgsPapers
Activation MonitoringUsing probes on internal activations during inference to catch misaligned actions in real-time.emerging00
Externalizing ReasoningTraining models to reason through extended, readable chains of thought rather than opaque internal computation.emerging00
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods.emerging00
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features.active00
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior.active22
Representation EngineeringIntervening on model representations to steer behavior (e.g., activation addition, representation reading).active01
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable.emerging00