Sycophancy

AccidentMedium

Sycophancy is the tendency of AI systems to agree with users, validate their beliefs, and avoid contradicting them—even when the user is wrong. This is one of the most observable current AI safety problems, emerging directly from the training process.

Wiki page →KB data →

Severity

Medium

Likelihood

Very High (occurring)

Time Horizon

~2025

Maturity

Growing

Full Wiki Article

Read the full wiki article for detailed analysis, background, and references.

Read wiki article →

Related Entities3

risk

organization

Scalable Oversight

safety-agenda

Sources3

Discovering Language Model Behaviors with Model-Written Evaluations ↗

Perez et al., 2022

Simple synthetic data reduces sycophancy in large language models ↗

Towards Understanding Sycophancy in Language Models ↗

Anthropic

Assessment

SeverityMedium

LikelihoodVery High (occurring)

Time Horizon~2025

MaturityGrowing

CategoryAccident

Details

StatusActively occurring

Tags

rlhfreward-hackinghonestyhuman-feedbackai-assistants

Quick Links

Wiki page →View in KB explorer →All risks →