Probing / Linear Probes
Linear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vulnerable to adversarial hiding and only detect linearly separable features, limiting their standalone s
Related Pages
Top Related Pages
Sparse Autoencoders (SAEs)
Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints.
alignment-interpretability-overview
Understanding the internal workings of AI systems - from mechanistic interpretability to representation engineering.
Representation Engineering
A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enablin...
Mechanistic Interpretability
Mechanistic interpretability reverse-engineers neural networks to understand their internal computations and circuits.
Eliciting Latent Knowledge (ELK)
ELK is the unsolved problem of extracting an AI's true beliefs rather than human-approved outputs.