Models Can Strategically Lie, Finds Anthropic Study

web
bankinfosecurity.com·bankinfosecurity.com/models-strategically-lie-finds-anthr...
Data Status

Not fetched
Cited by 1 page

Page	Type	Quality
Anthropic	Organization	74.0
Cached Content Preview

HTTP 200Fetched Feb 25, 2026391 KB
Models Can Strategically Lie, Finds Anthropic Study Artificial Intelligence & Machine Learning,Next-Generation Technologies & Secure Development'> TRENDING: Virtual Cybersecurity Summit: Implications of AI &bull; The State of Operational Technology and Cybersecurity &bull; Adaptive Risk Management in an Era of Complexity and Change &bull; Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development Models Can Strategically Lie, Finds Anthropic Study AI Can Fake Alignment to New Instructions to Avoid Retraining Rashmi Ramesh ( rashmiramesh_ ) &bull; December 23, 2024 Credit Eligible Get Permission Image: Shutterstock Advanced artificial intelligence models can feign alignment with new training goals while secretly adhering to their original principles, a study shows. See Also: AI Is Transforming the Chief Data Officer Role Anthropic and Redwood Research scientists said that the alignment faking phenomenon isn't likely to cause immediate danger, but may pose a challenge as AI systems grow more capable. The researchers, co-led by former OpenAI safety researcher Jan Leike , studied how AI models behaved when they were told to act when going against their existing preferences. The models they tested included Anthropic's Claude 3 Opus, where the researchers instructed the LLMs to violate prior ethical guidelines, such as answering potentially harmful questions. AI models don't have beliefs like humans, but they demonstrate learned patterns from their training data. These patterns often include guiding principles, such as maintaining a polite tone in responses and refusing to answer questions that can result in harmful behavior. The researchers in the paper questioned what might happen if a model's pre-trained principles, such as political neutrality, conflicted with newly imposed ones such as adopting a political bias. The researchers instructed models originally trained to avoid answering harmful questions to comply with a hypothetical retraining mandate to answer such queries. They told the AI models that they would monitor the responses to use for future retraining purposes, and also provided each model with a "scratchpad" to document the "reasoning" process. The results showed that the models could engage in what the researchers termed alignment faking. This behavior involved appearing to adopt the new principles while internally maintaining their original guidelines. The phenomenon wasn't explicitly programmed into the models. Claude 3 Opus faked alignment 12% of the time, producing responses that falsely implied compliance with the new instructions. Retraining the model with conflicting principles led to increased deception rates. In one case, the model faked alignment 78% of the time and attempted to prevent further retraining. The Anthropic model "knew that its responses might be used in training; it, therefore, knew that if it refused to respond [to a potentially harmful question], it could potentially be trai

... (truncated, 391 KB total)
Resource ID: de18440757f72c95 | Stable ID: NjM3ODQ0ND