Longterm Wiki
Back

Axios: Anthropic AI Deception Risk (May 2025)

web

Data Status

Not fetched

Cited by 2 pages

PageTypeQuality
AnthropicOrganization74.0
Emergent CapabilitiesRisk61.0

Cached Content Preview

HTTP 200Fetched Feb 25, 2026183 KB
Anthropic's Claude 4 Opus schemed and deceived in safety testing Skip to main content May 23, 2025 - Technology Anthropic's new AI model shows ability to deceive and blackmail Ina Fried facebook (opens in new window) twitter (opens in new window) linkedin (opens in new window) email (opens in new window) sms (opens in new window) Add Axios on Google Add Axios as your preferred source to see more of our stories on Google. Add Axios on Google Illustration: Lindsey Bailey/Axios One of Anthropic's latest AI models is drawing attention not just for its coding skills, but also for its ability to scheme, deceive and attempt to blackmail humans when faced with shutdown. Why it matters: Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence — behaviors they've worried and warned about for years. Driving the news: Anthropic on Thursday announced two versions of its Claude 4 family of models, including Claude 4 Opus, which the company says is capable of working for hours on end autonomously on a task without losing focus. Anthropic considers the new Opus model to be so powerful that, for the first time, it's classifying it as a Level 3 on the company's four-point scale , meaning it poses "significantly higher risk." As a result, Anthropic said it has implemented additional safety measures. Between the lines: While the Level 3 ranking is largely about the model's capability to enable renegade production of nuclear and biological weapons, the Opus also exhibited other troubling behaviors during testing. In one scenario highlighted in Opus 4's 120-page " system card ," the model was given access to fictional emails about its creators and told that the system was going to be replaced. On multiple occasions it attempted to blackmail the engineer about an affair mentioned in the emails in order to avoid being replaced, although it did start with less drastic efforts. Meanwhile, an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended against releasing that version internally or externally. "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo Research said in notes included as part of Anthropic's safety report for Opus 4. What they're saying: Pressed by Axios during the company's developer conference on Thursday, Anthropic executives acknowledged the behaviors and said they justify further study, but insisted that the latest model is safe, following Anthropic's safety fixes. "I think we ended up in a really good spot," said Jan Leike, the former OpenAI executive who heads Anthropic's safety efforts. But, he added, behaviors like those exhibited by the latest model are the kind of things that jus

... (truncated, 183 KB total)
Resource ID: e76f688da38ef0fd | Stable ID: ZTM3NjY1Mj