Back
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models (https://alignment.anthropic.c...
webalignment.anthropic.com·alignment.anthropic.com/2025/honesty-elicitation
Data Status
Not fetched
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Alignment Robustness Trajectory Model | Analysis | 64.0 |
Cached Content Preview
HTTP 200Fetched Feb 23, 202691 KB
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
Alignment Science Blog
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
Rowan Wang November 25, 2025
Johannes Treutlein, Fabien Roger
Evan Hubinger, Sam Marks
Check off list
Rowan
Johannes
Fabien
Sam M
David Meltzer (peer review)
Stuart Ritchie (for visibility only)
Jan Leike
tl;dr
We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty.
Introduction
Suppose we had a “truth serum for AIs”: a technique that reliably transforms a language model M into an honest model M_H that generates text which is truthful to the best of its own knowledge. How useful would this discovery be for AI safety?
We believe it would be a major boon. Most obviously, we could deploy M_H in place of M . Or, if our “truth serum” caused side-effects that limited M_H ’s commercial value (like capabilities degradation or refusal to engage in harmless fictional roleplay), M_H could still be used by AI developers as a tool for ensuring M ’s safety. For example, we could use M_H to audit M for alignment pre-deployment. More ambitiously (and speculatively), while training M , we could leverage M_H for oversight by incorporating M_H ’s honest assessment when assigning rewards. Generally, we could hope to use M_H to detect or prevent cases where M behaves in ways that M itself understands are flawed or unsafe.
In this work, we consider two related objectives: These objectives are very closely related. If we can train a model to be more honest, then we can ask it to classify the original model’s lies. Conversely, if we have a good lie detector for a model, we could (potentially, depending on whether the lie detector is robust enough to safely train against) use it to train the model to be more honest.
Lie detection: If an AI lies—that is, generates a statement it believes is false—can we detect that this happens?
Honesty: Can we make AIs generate fewer lies?
We consider a broad collection of honesty and lie detection techniques: black-box methods like prompting, grey-box methods like fine-tuning, and white-box methods like truth probing and honesty steering. Work on AI honesty is most critical in settings where humans cannot themselves verify correctness—for example when auditing models for hidden goals or detecting flaws in complex outputs. We therefore study honesty and lie detection under the constraint of no access to task-specific supervision . See below for more information on our problem statement and assumptions.
We compare these methods in five diverse testbed settings adapted from prior work, consisting of models (either
... (truncated, 91 KB total)Resource ID:
d875cbfb1b50d2a2 | Stable ID: ZmQ4NDhkYz