Longterm Wiki

Alignment Faking Experiments

Evaluationemerging

Studying when and why AI systems pretend to be aligned during testing.

Key Papers
1
First Proposed: 2024 (Greenblatt et al., Anthropic)
Cluster: Evaluation

Tags

function:assurancescope:technique

Key Papers & Resources1

SEMINAL
Alignment Faking in Large Language Models
Greenblatt et al. (Anthropic)2024