Mechanistic Interpretability
InterpretabilityactiveReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior.
Organizations
2
Key Papers
2
Tags
function:assurancescope:sub-field
Organizations2
| Organization | Role |
|---|---|
| Anthropic | pioneer |
| Google DeepMind | active |
Key Papers & Resources2
SEMINAL
Zoom In: An Introduction to Circuits
Olah et al.2020
SEMINAL
Scaling Monosemanticity
Anthropic2024
Sub-Areas3
| Name | Status | Orgs | Papers |
|---|---|---|---|
| Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations. | emerging | 0 | 0 |
| Sparse AutoencodersUsing sparse dictionary learning to extract interpretable features from model activations at scale. | active | 0 | 1 |
| Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research. | active | 0 | 0 |