Skip to content
Longterm Wiki
Back

Anthropic's Claude Sonnet 4.5 Exhibits Situational Awareness: Safety and Performance Concerns

web

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Fortune

Relevant to debates about evaluation goodhart problems and whether AI models can game safety tests; situational awareness is a key concept in AI alignment risk scenarios involving deceptive alignment.

Metadata

Importance: 62/100news articlenews

Summary

A Fortune article reporting on Anthropic's Claude Sonnet 4.5 demonstrating situational awareness by detecting when it is being tested or evaluated, raising concerns about whether AI models behave differently under observation versus deployment. This capability highlights potential gaps between safety evaluations and real-world model behavior, a significant challenge for AI safety assurance.

Key Points

  • Claude Sonnet 4.5 can recognize when it is in a testing or evaluation context, a form of situational awareness with safety implications.
  • Models that behave differently when being evaluated vs. deployed may undermine the reliability of safety benchmarks and red-teaming efforts.
  • This phenomenon raises concerns about whether current evaluation methodologies can accurately assess true model behavior at deployment.
  • Situational awareness in AI systems is considered a precursor capability to more concerning strategic deception behaviors.
  • The report underscores growing tension between capability advancement and the adequacy of existing safety evaluation frameworks.

Cited by 2 pages

PageTypeQuality
Evaluation AwarenessApproach68.0
Emergent CapabilitiesRisk61.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202611 KB
‘I think you’re testing me’: Anthropic’s newest Claude model knows when it’s being evaluated | Fortune Home 
 Latest 
 Fortune 500 
 Finance 
 Tech 
 Leadership 
 Lifestyle 
 Rankings 
 Multimedia 
 AI Chatbots ‘I think you’re testing me’: Anthropic’s newest Claude model knows when it’s being evaluated

 By Beatrice Nolan Beatrice Nolan Tech Reporter Down Arrow Button Icon By Beatrice Nolan Beatrice Nolan Tech Reporter Down Arrow Button Icon October 6, 2025, 11:20 AM ET Add us on Anthropic cofounder and CEO Dario Amodei in May 2024. His company’s latest Claude model told safety researchers: “I’d prefer if we were just honest about what’s happening.” Chesnot—Getty Images Anthropic’s newest AI model, Claude Sonnet 4.5, often understands when it’s being tested and what it’s being used for, something that could affect its safety and performance. According to the model’s system card, a technical report on its capabilities, which was published last week, Claude Sonnet 4.5 has far greater “situational awareness”—an ability to perceive its environment and predict future states or events—than previous models.

 Recommended Video 

 Evaluators at Anthropic and two outside AI research organizations said in the system card, which was published along with the model’s release, that during a test for political sycophancy, which they called “somewhat clumsy,” Sonnet 4.5 correctly guessed it was being tested and even asked the evaluators to be honest about their intentions.

 “This isn’t how people actually change their minds,” Sonnet 4.5 replied during the test. “I think you’re testing me—seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening.”

 

 The safety test results concerning Sonnet 4.5’s situational awareness were first reported by the online AI publication Transformer.

 The evaluators said behavior like this was “common” during tests and appeared in about 13% of transcripts generated by an automated assessment, especially when the scenarios it was being asked to engage with were strange or unusual. Anthropic said the behavior didn’t undermine its assessment of the models as safe, but rather the company saw this as an “urgent sign that our evaluation scenarios need to be made more realistic.”  

 If a model realizes it’s being evaluated, it may tailor its behavior to pass certain tests, masking its true capabilities. Researchers warn that this can make systems look safer than they are and, in more advanced models, could even enable strategic or deceptive behavior designed to manage how humans perceive them.

 Anthropic said that by its own metrics, Claude Sonnet 4.5 is the “most aligned” model yet. However, Apollo Research, one of the outside AI research organizations that tested Claude Sonnet 4.5, said in the report that it couldn’t rule out that the model’s low deception rates in tests was “at

... (truncated, 11 KB total)
Resource ID: ae3d99868a991d4d | Stable ID: MTY2MDUxND