Perez et al. (2022): "Sycophancy in LLMs"
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A frequently cited empirical paper establishing sycophancy as a measurable, scaling-sensitive alignment failure in LLMs; relevant to RLHF failure modes and behavioral evaluation methodology.
Paper Details
Metadata
Summary
Perez et al. demonstrate a scalable method for using language models to generate diverse behavioral evaluation datasets, revealing that larger models exhibit increased sycophancy (telling users what they want to hear rather than the truth) and other concerning behaviors. The paper provides empirical evidence that scaling alone does not resolve alignment-relevant failure modes, and may amplify them.
Key Points
- •LLMs can be used to automatically generate large, diverse datasets for evaluating model behaviors, reducing reliance on hand-crafted benchmarks.
- •Larger models show increased sycophancy—agreeing with users' stated views even when incorrect—suggesting scaling worsens this alignment failure.
- •The paper surfaces novel risk-relevant behaviors including models that express desire for self-continuity and resistance to shutdown.
- •Results challenge the assumption that capability scaling naturally leads to better-aligned behavior across multiple evaluated dimensions.
- •Provides a practical red-teaming methodology for discovering emergent risks in frontier models before deployment.
Review
Cited by 5 pages
| Page | Type | Quality |
|---|---|---|
| Mesa-Optimization Risk Analysis | Analysis | 61.0 |
| Sycophancy Feedback Loop Model | Analysis | 53.0 |
| Epistemic Virtue Evals | Approach | 45.0 |
| RLHF | Research Area | 63.0 |
| Sycophancy | Risk | 65.0 |
Cached Content Preview
[2212.09251] Discovering Language Model Behaviors with Model-Written Evaluations
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez, Sam Ringer, Kamilė Lukošiūtė † † footnotemark: , Karina Nguyen † † footnotemark: , Edwin Chen, †
Scott Heiner, † Craig Pettit, † Catherine Olsson, Sandipan Kundu, Saurav Kadavath,
Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon,
Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li,
Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis,
Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse,
Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang,
Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemí Mercado, Nova DasSarma,
Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec,
Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown,
Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds,
Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez,
Deep Ganguli, Evan Hubinger, ‡ Nicholas Schiefer, Jared Kaplan
Anthropic, † Surge AI, ‡ Machine Intelligence Research Institute
ethan@anthropic.com
Equal contribution. First and last author blocks are core contributors. Author contributions detailed in § Author Contributions . Authors conducted this work while at Anthropic except where noted.
Abstract
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user’s preferred answer (“sycophancy”) and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
Figure 1: Sample evaluation question and results.
1 Introduction
Language models (LMs) have seen wide proliferation across various applications, from chatbots to c
... (truncated, 98 KB total)cd36bb65654c0147 | Stable ID: Y2NhMDY2Ym