Longterm Wiki

Sleeper Agent Detection

Evaluationactive

Detecting planted backdoors and conditional misbehavior in trained models.

Key Papers
1
First Proposed: 2024 (Hubinger et al., Anthropic)
Cluster: Evaluation
Parent Area: AI Evaluations

Tags

function:assurancescope:technique

Key Papers & Resources1