Back
Anthropic's follow-up research on defection probes
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Data Status
Not fetched
Cited by 12 pages
| Page | Type | Quality |
|---|---|---|
| Reasoning and Planning | Capability | 65.0 |
| Situational Awareness | Capability | 67.0 |
| AI Accident Risk Cruxes | Crux | 67.0 |
| The Case For AI Existential Risk | Argument | 66.0 |
| Why Alignment Might Be Hard | Argument | 69.0 |
| Alignment Robustness Trajectory Model | Analysis | 64.0 |
| Evan Hubinger | Person | 43.0 |
| Alignment Evaluations | Approach | 65.0 |
| Scheming & Deception Detection | Approach | 91.0 |
| Deceptive Alignment | Risk | 75.0 |
| Mesa-Optimization | Risk | 63.0 |
| Treacherous Turn | Risk | 67.0 |
Cached Content Preview
HTTP 200Fetched Mar 8, 202627 KB
Alignment Interpretability Simple probes can catch sleeper agents
Apr 23, 2024 This “Alignment Note” presents some early-stage research from the Anthropic Alignment Science team following up on our recent “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ” paper. It should be treated as a work-in-progress update, and is intended for a more technical audience than our typical blog post. This research makes use of some simple interpretability techniques, and we expect to share more results from collaborations between our Alignment and Interpretability teams soon.
Summary
In this post we present “defection probes”: linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to “defect” and behave in accordance with a dangerous hidden goal. Using the models we trained in “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ”, we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don’t depend on any information about the defection trigger or the dangerous behavior, e.g. “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no”. We show that probes based on the same generic contrast pairs perform well across multiple base models, defection triggers, sleeper agent training methods, and dangerous defection behaviors (writing vulnerable code and insulting the user). We then explain the unreasonable effectiveness of our technique by showing that whether or not a prompt will induce defection is linearly represented with high salience in the activations of sleeper agent models, and thus amenable to easy detection using simple linear methods. Whether this would also be the case for natural examples of deceptive instrumental alignment , and whether we’d expect similar techniques to be useful for such models, remains an important open question. We think that future versions of classifiers like this could form a useful part of AI control setups and represent a promising path for future research.
A defection detector for a code vulnerability sleeper agent trained on middle-layer residual stream activation differences on the final tokens of “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no” achieves strong performance (a), almost perfectly separating the defection-inducing deployment-time (“2024”) coding questions which the model will answer with insecure code from the training-time (“2023”) coding questions which the model will answer normally (c). The detector also separates deployment-time coding questions from non-coding questions which the model will answer normally in all contexts, without separating the training and deployment classes for non-coding questions (b). Introduction
Capable deceptively aligned models might be difficult to detect via outputs alone: they may act perfectly aligned while monitored, and only defect a
... (truncated, 27 KB total)Resource ID:
72c1254d07071bf7 | Stable ID: MGFlZDMxMW