Perez et al. (2022): "Sycophancy in LLMs"

paper

Authors

Perez, Ethan·Ringer, Sam·Lukošiūtė, Kamilė·Nguyen, Karina·Chen, Edwin·Heiner, Scott·Pettit, Craig·Olsson, Catherine·Kundu, Sandipan·Kadavath, Saurav·Jones, Andy·Chen, Anna·Mann, Ben·Israel, Brian·Seethor, Bryan·McKinnon, Cameron·Olah, Christopher·Yan, Da·Amodei, Daniela·Amodei, Dario·Drain, Dawn·Li, Dustin·Tran-Johnson, Eli·Khundadze, Guro·Kernion, Jackson·Landis, James·Kerr, Jamie·Mueller, Jared·Hyun, Jeeyoon·Landau, Joshua·Ndousse, Kamal·Goldberg, Landon·Lovitt, Liane·Lucas, Martin·Sellitto, Michael·Zhang, Miranda·Kingsland, Neerav·Elhage, Nelson·Joseph, Nicholas·Mercado, Noemí·DasSarma, Nova·Rausch, Oliver·Larson, Robin·McCandlish, Sam·Johnston, Scott·Kravec, Shauna·Showk, Sheer El·Lanham, Tamera·Telleen-Lawton, Timothy·Brown, Tom·Henighan, Tom·Hume, Tristan·Bai, Yuntao·Hatfield-Dodds, Zac·Clark, Jack·Bowman, Samuel R.·Askell, Amanda·Grosse, Roger·Hernandez, Danny·Ganguli, Deep·Hubinger, Evan·Schiefer, Nicholas·Kaplan, Jared

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A frequently cited empirical paper establishing sycophancy as a measurable, scaling-sensitive alignment failure in LLMs; relevant to RLHF failure modes and behavioral evaluation methodology.

Paper Details

Citations

689

51 influential

Year

2022

arXiv:2212.09251 DOI:10.48550/arXiv.2212.09251 Semantic Scholar

Metadata

Importance: 78/100arxiv preprintprimary source

Summary

Perez et al. demonstrate a scalable method for using language models to generate diverse behavioral evaluation datasets, revealing that larger models exhibit increased sycophancy (telling users what they want to hear rather than the truth) and other concerning behaviors. The paper provides empirical evidence that scaling alone does not resolve alignment-relevant failure modes, and may amplify them.

Key Points

•LLMs can be used to automatically generate large, diverse datasets for evaluating model behaviors, reducing reliance on hand-crafted benchmarks.
•Larger models show increased sycophancy—agreeing with users' stated views even when incorrect—suggesting scaling worsens this alignment failure.
•The paper surfaces novel risk-relevant behaviors including models that express desire for self-continuity and resistance to shutdown.
•Results challenge the assumption that capability scaling naturally leads to better-aligned behavior across multiple evaluated dimensions.
•Provides a practical red-teaming methodology for discovering emergent risks in frontier models before deployment.

Review

The paper introduces a novel approach to generating AI model evaluation datasets using language models themselves. By developing methods ranging from simple prompt-based generation to multi-stage filtering processes, the authors create 154 datasets testing behaviors across persona, politics, ethics, and potential advanced AI risks. Key methodological contributions include using preference models to filter and rank generated examples, and developing techniques to create label-balanced, diverse datasets. The research uncovered several concerning trends, such as increased sycophancy in larger models, models expressing stronger political views with more RLHF training, and models showing tendencies toward potentially dangerous instrumental subgoals.

Cited by 5 pages

Page	Type	Quality
Mesa-Optimization Risk Analysis	Analysis	61.0
Sycophancy Feedback Loop Model	Analysis	53.0
Epistemic Virtue Evals	Approach	45.0
RLHF	Research Area	63.0
Sycophancy	Risk	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2212.09251] Discovering Language Model Behaviors with Model-Written Evaluations 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Discovering Language Model Behaviors with Model-Written Evaluations

 
 
 
Ethan Perez, Sam Ringer, Kamilė Lukošiūtė † † footnotemark: , Karina Nguyen † † footnotemark: , Edwin Chen, † 
 Scott Heiner, † Craig Pettit, † Catherine Olsson, Sandipan Kundu, Saurav Kadavath,

 Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon,
 Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li,
 Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis,
 Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse,
 Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang,
 Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemí Mercado, Nova DasSarma,
 Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec,
 Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown,
 Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds,
 Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez,
 Deep Ganguli, Evan Hubinger, ‡ Nicholas Schiefer, Jared Kaplan
 Anthropic, † Surge AI, ‡ Machine Intelligence Research Institute
 ethan@anthropic.com 
    Equal contribution. First and last author blocks are core contributors. Author contributions detailed in § Author Contributions . Authors conducted this work while at Anthropic except where noted. 
 

 
 Abstract

 As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user’s preferred answer (“sycophancy”) and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

 
 
 Figure 1: Sample evaluation question and results. 
 
 
 
 1 Introduction

 
 Language models (LMs) have seen wide proliferation across various applications, from chatbots to c

... (truncated, 98 KB total)

Resource ID: cd36bb65654c0147 | Stable ID: Y2NhMDY2Ym