Skip to content
Longterm Wiki
Back

2025 OpenAI-Anthropic joint evaluation

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

This joint evaluation is notable as a rare example of competing frontier AI labs collaborating on safety testing; results are relevant to discussions of corrigibility, instrumental convergence, and whether current models exhibit precursors to unsafe autonomous behavior.

Metadata

Importance: 72/100organizational reportprimary source

Summary

A collaborative safety evaluation conducted jointly by OpenAI and Anthropic to assess AI model behaviors related to corrigibility, shutdown resistance, and other safety-critical properties. The evaluation represents a notable instance of competing AI labs cooperating on safety testing methodologies and sharing results to advance the field's understanding of model alignment.

Key Points

  • Joint evaluation between two leading AI labs (OpenAI and Anthropic) marks significant cross-industry safety collaboration
  • Focuses on corrigibility and shutdown-related behaviors, testing whether models resist or comply with human oversight
  • Examines instrumental convergence concerns, including whether models pursue self-preservation or resource acquisition
  • Demonstrates a shared methodology for evaluating safety-critical behaviors that could inform industry-wide standards
  • Results provide empirical data on how frontier models respond to safety-relevant prompts and scenarios

Cited by 4 pages

Cached Content Preview

HTTP 200Fetched Apr 9, 202680 KB
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests | OpenAI

 

 
 
 
 

 Feb
 MAR
 Apr
 

 
 

 
 20
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: GDELT Project

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260320202445/https://openai.com/index/openai-anthropic-safety-evaluation/

 

li:hover)>li:not(:hover)>*]:text-primary-60 flex h-full min-w-0 items-baseline gap-0 overflow-x-hidden whitespace-nowrap [-ms-overflow-style:none] [scrollbar-width:none] focus-within:overflow-visible [&::-webkit-scrollbar]:hidden">
Research

Products

Business

Developers

Company

Foundation

Log in

Try ChatGPT

(opens in a new window)

Research

Products

Business

Developers

Company

Foundation

Try ChatGPT

(opens in a new window)Login

OpenAI

Table of contents

Introduction

Instruction hierarchy

Jailbreaking

Tutor jailbreak test

Hallucination

Scheming

August 27, 2025
Safety

Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

Loading…

Share

Introduction

This summer, OpenAI and Anthropic collaborated on a first-of-its-kind joint evaluation: we each ran our internal safety and misalignment evaluations on the other’s publicly released models and are now sharing the results publicly. We believe this approach supports accountable and transparent evaluation, helping to ensure that each lab’s models continue to be tested against new and challenging scenarios. We’ve since launched GPT‑5⁠, which shows substantial improvements in areas like sycophancy, hallucination, and misuse resistance, showing the benefits of reasoning-based safety techniques. The goal of this external evaluation is to help surface gaps that might otherwise be missed, deepen our understanding of potential misalignment, and demonstrate how labs can collaborate on issues of safety and alignment. 

Because the field continues to evolve and models are increasingly used to assist in real world tasks and problems, safety testing is never finished. Even since this work was done, we’ve further expanded the depth and breadth of our evaluations, and will continue looking ahead to anticipate potential safety issues. 

What we did 

In this post, we share the results of our internal evaluations we ran on Anthropic’s models Claude Opus 4 and Claude Sonnet 4, and present them alongside results from GPT‑4o, GPT‑4.1, OpenAI o3, and OpenAI o4-mini, which were the models powering ChatGPT at the time. The results of Anthropic’s evaluations of our models are available here⁠(opens in a new window). We recommend reading both reports to get the full picture of all the models on all the evaluations.

Both labs facilitated these evaluations by relaxing some model-external safeguards that would otherwise interfere with the completion of the tests, 

... (truncated, 80 KB total)
Resource ID: cc554bd1593f0504 | Stable ID: sid_QpipA2Jfxc