OpenAI CoT Monitoring
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
An OpenAI safety research page on using chain-of-thought reasoning traces as a monitoring and alignment verification tool, relevant to scalable oversight and detecting deceptive or misaligned model behavior in deployed systems.
Metadata
Summary
OpenAI describes their approach to monitoring chain-of-thought (CoT) reasoning in large language models as a safety measure, examining whether models' visible reasoning steps are faithful and can detect deceptive or misaligned behavior. The work explores using CoT transparency as a tool for alignment verification and identifying specification gaming or reward hacking patterns before deployment.
Key Points
- •Chain-of-thought reasoning can serve as a monitoring signal to detect whether models are pursuing unintended or deceptive strategies
- •Visible reasoning steps may reveal misalignment between stated objectives and actual model behavior, enabling early intervention
- •CoT monitoring is proposed as part of a broader scalable oversight strategy to maintain human visibility into model decision-making
- •The approach investigates whether models' internal reasoning is faithful enough to be trusted as a safety signal
- •Relevant to detecting reward hacking and specification gaming by catching problematic reasoning before it manifests in actions
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Reward Hacking Taxonomy and Severity Model | Analysis | 71.0 |
| Eval Saturation & The Evals Gap | Approach | 65.0 |
| Scalable Eval Approaches | Approach | 65.0 |
Cached Content Preview
Detecting misbehavior in frontier reasoning models | OpenAI
Feb
MAR
Apr
12
2025
2026
2027
success
fail
About this capture
COLLECTED BY
Collection: Save Page Now Outlinks
TIMESTAMPS
The Wayback Machine - http://web.archive.org/web/20260312172429/https://openai.com/index/chain-of-thought-monitoring/
li:hover)>li:not(:hover)>*]:text-primary-60 flex h-full min-w-0 items-baseline gap-0 overflow-x-hidden whitespace-nowrap [-ms-overflow-style:none] [scrollbar-width:none] [&::-webkit-scrollbar]:hidden">
Research
Safety
For Business
For Developers
ChatGPT(opens in a new window)
Sora
Codex
Stories
Company
News
Log in
Try ChatGPT
(opens in a new window)
Research
Safety
For Business
For Developers
ChatGPT
(opens in a new window)
Sora
Codex
Stories
Company
News
Try ChatGPT
(opens in a new window)Login
OpenAI
Table of contents
Monitoring frontier reasoning models for reward hacking
Stopping “bad thoughts” may not stop bad behavior
Looking forward
March 10, 2025
Publication
Detecting misbehavior in frontier reasoning models
Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.
Read paper
(opens in a new window)
Loading…
Share
Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard.
We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.
We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then
We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.
We understand that leaving CoTs unrestricted may make them unfit to be shown to end-users, as they might violate some misuse policies. Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.
Humans often find and exploit loopholes—whether it be sharing online subscription accounts against terms of service, claiming subsidies meant for others, interpreting regulations in unforeseen ways, or even lying about
... (truncated, 17 KB total)d4700c15258393ad | Stable ID: sid_Xjt8sXLFnQ