Longterm Wiki

Sleeper Agents Research

AnthropicResearch Areassleeper-agents

Record Metadata

Record Keysleeper-agents
EntityAnthropic
CollectionResearch Areas(6 records total)
SchemaMajor research initiatives and focus areas.
YAML Filepackages/kb/data/things/mK9pX3rQ7n.yaml

Fields

NameSleeper Agents Research
DescriptionInvestigating whether AI systems can maintain hidden behaviors through training
StartedJan 2024
Key Publicationarxiv.org
NotesSeminal paper on deceptive alignment

Other Records in Research Areas (5)

KeyNameDescriptionTeam Size
mechanistic-interpretabilityMechanistic InterpretabilityUnderstanding neural network internals through reverse-engineering50
constitutional-aiConstitutional AITraining AI systems to follow principles through self-critique and RLAIF
alignment-scienceAlignment ScienceScalable oversight, weak-to-strong generalization, robustness to jailbreaks
responsible-scaling-policyResponsible Scaling PolicyFramework for evaluating and mitigating risks at each capability level
ai-welfareAI Welfare ResearchInvestigating moral status and welfare considerations for AI systems
Record: sleeper-agents | Longterm Wiki