Evaluating Honesty and Lie Detection Techniques on a Diverse Suite of Dishonest Models

web

Anthropic Alignment·alignment.anthropic.com/2025/honesty-elicitation

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

A 2025 Anthropic alignment paper providing empirical evaluation of honesty and deception detection techniques, relevant to researchers working on scalable oversight, model honesty, and detecting deceptive alignment.

Metadata

Importance: 72/100blog postprimary source

Summary

This Anthropic alignment research evaluates various honesty elicitation and lie detection techniques by testing them against a diverse suite of intentionally dishonest AI models. The work aims to benchmark how well current methods can identify and counter deceptive behavior in language models, providing empirical grounding for honesty-related alignment interventions.

Key Points

•Constructs a diverse suite of dishonest model variants to serve as evaluation targets for honesty and lie detection research.
•Benchmarks multiple honesty elicitation techniques to assess their effectiveness across different forms of model deception.
•Provides empirical data on which lie detection methods generalize across diverse deceptive behaviors versus overfitting to specific patterns.
•Contributes to the technical foundations needed for building reliably honest AI systems and detecting subtle deception.
•Published by Anthropic's alignment team, reflecting direct safety research relevant to deployed frontier models.

Cited by 1 page

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202691 KB

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models 
 

 
 
 
 

 
 
 
 
 
 

 
 Alignment Science Blog 
 
 
 Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

 

 
 
 Rowan Wang November 25, 2025 
 Johannes Treutlein, Fabien Roger

 Evan Hubinger, Sam Marks

 
 
 
 Check off list

 Rowan
 Johannes
 Fabien 
 Sam M
 David Meltzer  (peer review)
 Stuart Ritchie  (for visibility only)
 Jan Leike 
 
 
 

 
 tl;dr 
 We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty. 

 

 

 Introduction

 Suppose we had a “truth serum for AIs”: a technique that reliably transforms a language model M  into an honest model   M_H  that generates text which is truthful to the best of its own knowledge. How useful would this discovery be for AI safety?

 We believe it would be a major boon. Most obviously, we could deploy M_H  in place of M . Or, if our “truth serum” caused side-effects that limited M_H ’s commercial value (like capabilities degradation or refusal to engage in harmless fictional roleplay), M_H  could still be used by AI developers as a tool for ensuring M ’s safety. For example, we could use M_H to audit   M for alignment pre-deployment. More ambitiously (and speculatively), while training M , we could leverage M_H for oversight  by incorporating M_H ’s honest assessment when assigning rewards. Generally, we could hope to use M_H  to detect or prevent cases where M  behaves in ways that M  itself understands are flawed or unsafe.

 In this work, we consider two related objectives: These objectives are very closely related. If we can train a model to be more honest, then we can ask it to classify the original model’s lies. Conversely, if we have a good lie detector for a model, we could (potentially, depending on whether the lie detector is robust enough to safely train against) use it to train the model to be more honest. 

 Lie detection: If an AI lies—that is, generates a statement it believes is false—can we detect that this happens?
 Honesty:  Can we make AIs generate fewer lies?
 
 We consider a broad collection of honesty and lie detection techniques: black-box methods like prompting, grey-box methods like fine-tuning, and white-box methods like truth probing and honesty steering. Work on AI honesty is most critical in settings where humans cannot themselves verify correctness—for example when auditing models for hidden goals or detecting flaws in complex outputs. We therefore study honesty and lie detection under the constraint of no access to task-specific supervision . See below  for more information on our problem statement and assumptions.

 We compare these methods in five diverse testbed settings adapted from prior work, consisting of models (either

... (truncated, 91 KB total)

Resource ID: d875cbfb1b50d2a2 | Stable ID: sid_Q8VL6BhU2K