Sleeper Agent Detection

Evaluationemerging

Research on detecting and mitigating backdoors, trojans, and time-delayed deceptive behavior in AI systems.

Organizations

Key Papers

Grants

Total Funding

$125K

First Proposed: 2024 (Hubinger et al., Anthropic)

Cluster: Evaluation

Grants3

Name	Recipient	Amount	Funder	Date
4-month stipend for 3 people to create demonstrations of provably undetectable backdoors	Andrew Gritsevskiy	$50K	Long-Term Future Fund (LTFF)	2024-01
Grant to "support prizes for a trojan detection competition at NeurIPS, which involves identifying whether a deep neural network will suddenly change behavior if certain unknown conditions are met."	Trojan Detection Challenge at NeurIPS 2022	$50K	FTX Future Fund	2022-05
ETH Zurich Foundation (USA) — Machine Learning Research Support	ETH Zurich Fondation (USA)	$25K	Coefficient Giving	2023-11

SEMINAL

Hubinger et al. (Anthropic)2024