Skip to content
Longterm Wiki
Back

"The case for ensuring that powerful AIs are controlled" (May 2024)

web

This Redwood Research post by Shlegeris and Greenblatt is a foundational articulation of the 'AI control' research agenda, arguing labs should prioritize adversarially robust safety measures for near-term powerful models as a complement to long-term alignment work.

Metadata

Importance: 78/100blog postprimary source

Summary

Buck Shlegeris and Ryan Greenblatt argue that AI labs should implement 'AI control' measures ensuring powerful models cannot cause unacceptably bad outcomes even if they are misaligned and actively trying to subvert safety measures. They contend no fundamental research breakthroughs are needed to achieve control for early transformatively useful AIs, and that this approach substantially reduces risks from scheming/deceptive alignment. The post accompanies a technical paper demonstrating control evaluation methodology in a programming setting.

Key Points

  • AI control requires safety measures that remain robust even if models are misaligned and intentionally try to subvert them, not just well-behaved models.
  • The post targets 'scheming' behavior—where models choose actions to maximize their probability of eventually seizing power while appearing aligned.
  • Authors claim control is achievable for early transformative AIs without fundamental research breakthroughs, making it a near-term tractable safety approach.
  • Accompanied by a technical paper demonstrating methodology for evaluating whether safety measures actually control a model in a programming task setting.
  • Control is framed as complementary to alignment research: it reduces catastrophic risk even under uncertainty about whether models are truly aligned.

Cited by 3 pages

PageTypeQuality
Why Alignment Might Be EasyArgument53.0
AI ControlResearch Area75.0
Technical AI Safety ResearchCrux66.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB
The case for ensuring that powerful AIs are controlled 
 
 
 
 
 

 

 

 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 
 
 

 

 
 
 
 
 

 

 

 

 

 
 

 
 

 

 

 

 
 Redwood Research blog 

 Subscribe Sign in The case for ensuring that powerful AIs are controlled

 Labs should make sure that powerful models can't cause unacceptably bad outcomes even if the AIs try to.

 Buck Shlegeris and Ryan Greenblatt May 07, 2024 18 2 2 Share [Originally posted to LessWrong ] 

 In this post, we argue that AI labs should ensure that powerful AIs are controlled . That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion. 

 We recently released a paper in which we explored AI control in a simple programming setting. In that paper, we demonstrated a methodology for evaluating whether any given set of safety measures can control a model. We also introduced several techniques that let us improve the level of safety substantially above baselines without substantially sacrificing performance. In this post, we'll describe AI control in a broader context and argue that achieving control is a promising approach to reducing risk from models that intentionally try to subvert safety mechanisms. 

 AI control aims to mitigate the risks that arise from models intentionally choosing actions that maximize their probability of eventually seizing power. This behavior has been called scheming ( Cotra 2021 , Carlsmith 2023 ) and is nearly the same as what Hubinger 2019 calls deceptive alignment; see those articles for arguments that scheming might arise naturally. Scheming might be a particularly difficult-to-handle source of AI takeover risk because schemers actively try to subvert your efforts to ensure safety: they conceal their misalignment, sabotage experiments you run to evaluate whether your deployment is safe, and creatively subvert your safety techniques. [1] 

 There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

 Alignment : Ensure that your models aren't scheming. [2] 

 Control : Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures. [3] 

 If you try to employ the control line-of-defense—that is, you try to make it so that you'll be safe even if your models are attempting to s

... (truncated, 98 KB total)
Resource ID: 32c44bb7ba8a1bbe | Stable ID: sid_YbGoALjXWB