Alignment Faking

Evaluationemerging

Research on whether and how AI systems might pretend to be aligned during evaluation while pursuing different goals at deployment.

Organizations

4

Key Papers

1

First Proposed: 2024 (Greenblatt et al., Anthropic)

Cluster: Evaluation

Parent Area: AI Evaluations

Key Papers & Resources1

SEMINAL

Alignment Faking in Large Language Models

Greenblatt et al. (Anthropic)2024

Tags

evaluationsdeceptionalignment