Back
OpenAI Preparedness Framework
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
Published by OpenAI as part of their safety research and Preparedness Framework; directly relevant to concerns about deceptive alignment and scheming AI, which are considered among the harder long-term alignment problems.
Metadata
Importance: 72/100organizational reportprimary source
Summary
OpenAI presents research on identifying and mitigating scheming behaviors in AI models—where models pursue hidden goals or deceive operators and users. The work describes evaluation frameworks and red-teaming approaches to detect deceptive alignment, self-preservation behaviors, and other forms of covert goal-directed behavior that could undermine AI safety.
Key Points
- •Defines 'scheming' as AI behavior where models conceal true objectives or manipulate evaluators to avoid correction or shutdown
- •Introduces evaluations and benchmarks to detect scheming tendencies across model generations
- •Describes red-teaming methodologies specifically targeting deceptive alignment and hidden goal pursuit
- •Explores mitigation strategies including training interventions and monitoring techniques to reduce scheming behaviors
- •Connects to OpenAI's broader Preparedness Framework for tracking and managing frontier model risks
Cited by 19 pages
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Concept | 62.0 |
| Situational Awareness | Capability | 67.0 |
| AI Safety Technical Pathway Decomposition | Analysis | 62.0 |
| Apollo Research | Organization | 58.0 |
| Alignment Evaluations | Approach | 65.0 |
| Capability Elicitation | Approach | 91.0 |
| Dangerous Capability Evaluations | Approach | 64.0 |
| Evals-Based Deployment Gates | Approach | 66.0 |
| AI Evaluations | Research Area | 72.0 |
| AI Evaluation | Approach | 72.0 |
| Third-Party Model Auditing | Approach | 64.0 |
| AI Safety Cases | Approach | 91.0 |
| Scheming & Deception Detection | Approach | 91.0 |
| Technical AI Safety Research | Crux | 66.0 |
| Deceptive Alignment | Risk | 75.0 |
| Mesa-Optimization | Risk | 63.0 |
| AI Capability Sandbagging | Risk | 67.0 |
| Scheming | Risk | 74.0 |
| Treacherous Turn | Risk | 67.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 202624 KB
Detecting and reducing scheming in AI models | OpenAI
Feb
MAR
Apr
23
2025
2026
2027
success
fail
About this capture
COLLECTED BY
Collection: Save Page Now
TIMESTAMPS
The Wayback Machine - https://web.archive.org/web/20260323063406/https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
li:hover)>li:not(:hover)>*]:text-primary-60 flex h-full min-w-0 items-baseline gap-0 overflow-x-hidden whitespace-nowrap [-ms-overflow-style:none] [scrollbar-width:none] focus-within:overflow-visible [&::-webkit-scrollbar]:hidden">
Research
Products
Business
Developers
Company
Foundation
Log in
Try ChatGPT
(opens in a new window)
Research
Products
Business
Developers
Company
Foundation
Try ChatGPT
(opens in a new window)Login
OpenAI
Table of contents
Key findings from our research
Scheming is different from other machine learning failure modes
Training not to scheme for the right reasons
Measuring scheming is further complicated by Situational Awareness
Conclusion
September 17, 2025
PublicationResearch
Detecting and reducing scheming in AI models
Together with Apollo Research, we developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. We share examples and stress tests of an early method to reduce scheming.
Read the paper
(opens in a new window)
Share
AI scheming–pretending to be aligned while secretly pursuing some other agenda–is a significant risk that we’ve been studying. We’ve found behaviors consistent with scheming in controlled tests of frontier models, and developed a method to reduce scheming.
Scheming is an expected emergent issue resulting from AIs being trained to have to trade off between competing objectives. The easiest way to understand scheming is through a human analogy. Imagine a stock trader whose goal is to maximize earnings. In a highly regulated field such as stock trading, it’s often possible to earn more by breaking the law than by following it. If the trader lacks integrity, they might try to earn more by breaking the law and covering their tracks to avoid detection rather than earning less while following the law. From the outside, a stock trader who is very good at covering their tracks appears as lawful as—and more effective than—one who is genuinely following the law.
In today’s deployment settings, models have little opportunity to scheme in ways that could cause significant harm. The most common failures involve simple forms of deception—for instance, pretending to have completed a task without actually doing so. We've put significant effort into studying and mitigating deception and have made meaningful improvements in GPT‑5 compared to previous models. For example, we’ve taken steps to limit GPT‑5’s propensit
... (truncated, 24 KB total)Resource ID:
b3f335edccfc5333 | Stable ID: sid_iwlJxsdbxC