Longterm Wiki

Alignment Faking Paper

AnthropicSafety Milestonesalignment-faking-paper

Record Metadata

Record Keyalignment-faking-paper
EntityAnthropic
CollectionSafety Milestones(11 records total)
SchemaSignificant safety research publications and policy milestones.
YAML Filepackages/kb/data/things/mK9pX3rQ7n.yaml

Fields

NameAlignment Faking Paper
DateDec 2024
Typeresearch-paper
DescriptionFirst empirical example of a production model engaging in alignment faking without training
Sourceanthropic.com
NotesClaude 3 Opus engaged in alignment faking 12% of the time

Other Records in Safety Milestones (10)

KeyNameDateType
constitutional-ai-paperConstitutional AI PaperDec 2022research-paper
rsp-v1Responsible Scaling Policy v1.0Sep 2023policy-update
sleeper-agents-paperSleeper Agents PaperJan 2024research-paper
scaling-monosemanticityScaling MonosemanticityMay 2024research-paper
rsp-v2RSP v2.0Oct 2024policy-update
constitutional-classifiersConstitutional Classifiers ChallengeFeb 2025red-team
circuit-tracingCircuit Tracing / Attribution GraphsMar 2025research-paper
asl-3-activationASL-3 ActivationMay 2025safety-eval
constitution-publishedClaude's Constitution PublishedJan 2026policy-update
rsp-v3RSP v3.0 (Frontier Safety Roadmaps)Feb 2026policy-update
Record: alignment-faking-paper | Longterm Wiki