Longterm Wiki

Sleeper Agents Paper

AnthropicSafety Milestonessleeper-agents-paper

Record Metadata

Record Keysleeper-agents-paper
EntityAnthropic
CollectionSafety Milestones(11 records total)
SchemaSignificant safety research publications and policy milestones.
YAML Filepackages/kb/data/things/mK9pX3rQ7n.yaml

Fields

NameSleeper Agents Paper
DateJan 2024
Typeresearch-paper
DescriptionShowed deceptive LLM behaviors can persist through safety training
Sourcearxiv.org
NotesSeminal paper on deceptive alignment; model inserted vulnerabilities when year changed

Other Records in Safety Milestones (10)

KeyNameDateType
constitutional-ai-paperConstitutional AI PaperDec 2022research-paper
rsp-v1Responsible Scaling Policy v1.0Sep 2023policy-update
scaling-monosemanticityScaling MonosemanticityMay 2024research-paper
rsp-v2RSP v2.0Oct 2024policy-update
alignment-faking-paperAlignment Faking PaperDec 2024research-paper
constitutional-classifiersConstitutional Classifiers ChallengeFeb 2025red-team
circuit-tracingCircuit Tracing / Attribution GraphsMar 2025research-paper
asl-3-activationASL-3 ActivationMay 2025safety-eval
constitution-publishedClaude's Constitution PublishedJan 2026policy-update
rsp-v3RSP v3.0 (Frontier Safety Roadmaps)Feb 2026policy-update
Record: sleeper-agents-paper | Longterm Wiki