Longterm Wiki

Probing / Linear Probes

Linear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vulnerable to adversarial hiding and only detect linearly separable features, limiting their standalone s

Related Pages

Top Related Pages

Approaches

Sleeper Agent DetectionWeak-to-Strong Generalization

Other

Interpretability

Key Debates

Is Interpretability Sufficient for Safety?