Alignment Faking Experiments
EvaluationemergingStudying when and why AI systems pretend to be aligned during testing.
Key Papers
1
First Proposed: 2024 (Greenblatt et al., Anthropic)
Cluster: Evaluation
Parent Area: Scheming / Deception Detection
Tags
function:assurancescope:technique
Key Papers & Resources1
SEMINAL
Alignment Faking in Large Language Models
Greenblatt et al. (Anthropic)2024