Recent multi-lab research
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Abstract
AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Reasoning and Planning | Capability | 65.0 |
| Scalable Eval Approaches | Approach | 65.0 |
Cached Content Preview
[2507.11473] Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain of Thought Monitorability:
A New and Fragile Opportunity for AI Safety
Tomek Korbak ∗ UK AI Security Institute
Mikita Balesni ∗ Apollo Research
Elizabeth Barnes METR
Yoshua Bengio University of Montreal & Mila
Joe Benton Anthropic
Joseph Bloom UK AI Security Institute
Mark Chen OpenAI
Alan Cooney UK AI Security Institute
Allan Dafoe Google DeepMind
Anca Dragan Google DeepMind
Scott Emmons Google DeepMind
Owain Evans Truthful AI & UC Berkeley
David Farhi OpenAI
Ryan Greenblatt Redwood Research
Dan Hendrycks Center for AI Safety
Marius Hobbhahn Apollo Research
Evan Hubinger Anthropic
Geoffrey Irving UK AI Security Institute
Erik Jenner Google DeepMind
Daniel Kokotajlo AI Futures Project
Victoria Krakovna Google DeepMind
Shane Legg Google DeepMind
David Lindner Google DeepMind
David Luan Amazon
Aleksander Mądry OpenAI
Julian Michael Scale AI
Neel Nanda Google DeepMind
Dave Orr Google DeepMind
Jakub Pachocki OpenAI
Ethan Perez Anthropic
Mary Phuong Google DeepMind
Fabien Roger Anthropic
Joshua Saxe Meta
Buck Shlegeris Redwood Research
Martín Soto UK AI Security Institute
Eric Steinberger Magic
Jasmine Wang UK AI Security Institute
Wojciech Zaremba OpenAI
Bowen Baker † OpenAI
Rohin Shah † Google DeepMind
Vlad Mikulik † Anthropic
Abstract
AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
* * footnotetext: Equal first authors, † \dagger Equal senior authors. Correspondence: tomek.korbak@dsit.gov.uk and mikita@apolloresearch.ai . The paper represents the views of the individual authors and not necessarily of their affiliated institutions.
Expert endorsers:
Samuel R. Bowman Anthropic
John Schulman Thinking Machines
Geoffrey Hinton University of Toronto
Ilya Sutskever Safe Superintelligence Inc
1 Chain of Thought Offers a Unique Safety Opportunity
The opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act through human language, one might hope they are easier to understand than other approaches to AI. The discovery of Ch
... (truncated, 40 KB total)e2a66d86361bb628 | Stable ID: MWU1MzlkZD