Longterm Wiki
Back

Recent multi-lab research

paper

Authors

Tomek Korbak·Mikita Balesni·Elizabeth Barnes·Yoshua Bengio·Joe Benton·Joseph Bloom·Mark Chen·Alan Cooney·Allan Dafoe·Anca Dragan·Scott Emmons·Owain Evans·David Farhi·Ryan Greenblatt·Dan Hendrycks·Marius Hobbhahn·Evan Hubinger·Geoffrey Irving·Erik Jenner·Daniel Kokotajlo·Victoria Krakovna·Shane Legg·David Lindner·David Luan·Aleksander Mądry·Julian Michael·Neel Nanda·Dave Orr·Jakub Pachocki·Ethan Perez·Mary Phuong·Fabien Roger·Joshua Saxe·Buck Shlegeris·Martín Soto·Eric Steinberger·Jasmine Wang·Wojciech Zaremba·Bowen Baker·Rohin Shah·Vlad Mikulik

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Cited by 2 pages

PageTypeQuality
Reasoning and PlanningCapability65.0
Scalable Eval ApproachesApproach65.0

Cached Content Preview

HTTP 200Fetched Mar 8, 202640 KB
[2507.11473] Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Chain of Thought Monitorability:
 A New and Fragile Opportunity for AI Safety

 
 
 
 
 
 
 
 
 
 Tomek Korbak ∗ UK AI Security Institute 
 
 
 Mikita Balesni ∗ Apollo Research 
 
 
 Elizabeth Barnes METR 
 Yoshua Bengio University of Montreal & Mila 
 
 Joe Benton Anthropic 
 Joseph Bloom UK AI Security Institute 
 
 Mark Chen OpenAI 
 Alan Cooney UK AI Security Institute 
 
 Allan Dafoe Google DeepMind 
 Anca Dragan Google DeepMind 
 
 Scott Emmons Google DeepMind 
 Owain Evans Truthful AI & UC Berkeley 
 
 David Farhi OpenAI 
 Ryan Greenblatt Redwood Research 
 
 Dan Hendrycks Center for AI Safety 
 Marius Hobbhahn Apollo Research 
 
 Evan Hubinger Anthropic 
 Geoffrey Irving UK AI Security Institute 
 
 Erik Jenner Google DeepMind 
 Daniel Kokotajlo AI Futures Project 
 
 Victoria Krakovna Google DeepMind 
 Shane Legg Google DeepMind 
 
 David Lindner Google DeepMind 
 David Luan Amazon 
 
 Aleksander Mądry OpenAI 
 Julian Michael Scale AI 
 
 Neel Nanda Google DeepMind 
 Dave Orr Google DeepMind 
 
 Jakub Pachocki OpenAI 
 Ethan Perez Anthropic 
 
 Mary Phuong Google DeepMind 
 Fabien Roger Anthropic 
 
 Joshua Saxe Meta 
 Buck Shlegeris Redwood Research 
 
 Martín Soto UK AI Security Institute 
 Eric Steinberger Magic 
 
 Jasmine Wang UK AI Security Institute 
 Wojciech Zaremba OpenAI 
 
 Bowen Baker † OpenAI 
 
 
 Rohin Shah † Google DeepMind 
 
 
 Vlad Mikulik † Anthropic 
 
 
 
 
 
 
 

 
 Abstract

 AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

 
 * * footnotetext: Equal first authors, † \dagger Equal senior authors. Correspondence: tomek.korbak@dsit.gov.uk and mikita@apolloresearch.ai . The paper represents the views of the individual authors and not necessarily of their affiliated institutions. 
 
 
 
 
 Expert endorsers: 
 
 
 
 
 
 Samuel R. Bowman Anthropic 
 
 
 John Schulman Thinking Machines 
 
 
 
 
 Geoffrey Hinton University of Toronto 
 
 
 Ilya Sutskever Safe Superintelligence Inc 
 
 
 
 
 
 
 
 1 Chain of Thought Offers a Unique Safety Opportunity

 
 The opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act through human language, one might hope they are easier to understand than other approaches to AI. The discovery of Ch

... (truncated, 40 KB total)
Resource ID: e2a66d86361bb628 | Stable ID: MWU1MzlkZD