Longterm Wiki
Back

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models (https://alignment.anthropic.c...

web

Data Status

Not fetched

Cited by 1 page

PageTypeQuality
Alignment Robustness Trajectory ModelAnalysis64.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202691 KB
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models 
 

 
 
 
 

 
 
 
 
 
 

 
 Alignment Science Blog 
 
 
 Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

 

 
 
 Rowan Wang November 25, 2025 
 Johannes Treutlein, Fabien Roger

 Evan Hubinger, Sam Marks

 
 
 
 Check off list

 Rowan
 Johannes
 Fabien 
 Sam M
 David Meltzer  (peer review)
 Stuart Ritchie  (for visibility only)
 Jan Leike 
 
 
 

 
 tl;dr 
 We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty. 

 

 

 Introduction

 Suppose we had a “truth serum for AIs”: a technique that reliably transforms a language model M  into an honest model   M_H  that generates text which is truthful to the best of its own knowledge. How useful would this discovery be for AI safety?

 We believe it would be a major boon. Most obviously, we could deploy M_H  in place of M . Or, if our “truth serum” caused side-effects that limited M_H ’s commercial value (like capabilities degradation or refusal to engage in harmless fictional roleplay), M_H  could still be used by AI developers as a tool for ensuring M ’s safety. For example, we could use M_H to audit   M for alignment pre-deployment. More ambitiously (and speculatively), while training M , we could leverage M_H for oversight  by incorporating M_H ’s honest assessment when assigning rewards. Generally, we could hope to use M_H  to detect or prevent cases where M  behaves in ways that M  itself understands are flawed or unsafe.

 In this work, we consider two related objectives: These objectives are very closely related. If we can train a model to be more honest, then we can ask it to classify the original model’s lies. Conversely, if we have a good lie detector for a model, we could (potentially, depending on whether the lie detector is robust enough to safely train against) use it to train the model to be more honest. 

 Lie detection: If an AI lies—that is, generates a statement it believes is false—can we detect that this happens?
 Honesty:  Can we make AIs generate fewer lies?
 
 We consider a broad collection of honesty and lie detection techniques: black-box methods like prompting, grey-box methods like fine-tuning, and white-box methods like truth probing and honesty steering. Work on AI honesty is most critical in settings where humans cannot themselves verify correctness—for example when auditing models for hidden goals or detecting flaws in complex outputs. We therefore study honesty and lie detection under the constraint of no access to task-specific supervision . See below  for more information on our problem statement and assumptions.

 We compare these methods in five diverse testbed settings adapted from prior work, consisting of models (either

... (truncated, 91 KB total)
Resource ID: d875cbfb1b50d2a2 | Stable ID: ZmQ4NDhkYz