Interpretability (Overview)
Interpretability research aims to understand what AI systems are "thinking" and why they behave as they do.
Overview:
- Interpretability: The field and its importance for safety
Mechanistic Approaches:
- Mechanistic Interpretability: Reverse-engineering neural networks
- Sparse Autoencoders: Learning interpretable features
- Probing: Testing for specific knowledge or concepts
- Circuit Breakers: Identifying and modifying specific circuits
Representation-Based:
- Representation Engineering: Controlling behavior via internal representations