Interpretability
InterpretabilityactiveUmbrella field of research aimed at understanding how AI models work internally.
Organizations
3
Risks Addressed
3
Cluster: Interpretability
Tags
function:assurancescope:field
Organizations3
| Organization | Role |
|---|---|
| Anthropic | pioneer |
| Google DeepMind | active |
| Redwood Research | active |
Sub-Areas7
| Name | Status | Orgs | Papers |
|---|---|---|---|
| Activation MonitoringUsing probes on internal activations during inference to catch misaligned actions in real-time. | emerging | 0 | 0 |
| Externalizing ReasoningTraining models to reason through extended, readable chains of thought rather than opaque internal computation. | emerging | 0 | 0 |
| Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods. | emerging | 0 | 0 |
| Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features. | active | 0 | 0 |
| Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior. | active | 2 | 2 |
| Representation EngineeringIntervening on model representations to steer behavior (e.g., activation addition, representation reading). | active | 0 | 1 |
| Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable. | emerging | 0 | 0 |