Skip to content
Longterm Wiki

Alignment Faking

Evaluationemerging
Research on whether and how AI systems might pretend to be aligned during evaluation while pursuing different goals at deployment.
Organizations
4
Key Papers
1
First Proposed: 2024 (Greenblatt et al., Anthropic)
Cluster: Evaluation
Parent Area: AI Evaluations

Key Papers & Resources1

SEMINAL
Alignment Faking in Large Language Models
Greenblatt et al. (Anthropic)2024

Tags

evaluationsdeceptionalignment