Skip to content
Longterm Wiki
Back

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment

web

This AXRP podcast episode features Evan Hubinger, a key researcher at Anthropic, discussing model organisms of misalignment—a research paradigm he has helped pioneer to empirically study alignment failure modes rather than relying solely on theoretical analysis.

Metadata

Importance: 72/100podcast episodeprimary source

Summary

An interview with Evan Hubinger on the concept of 'model organisms of misalignment'—deliberately constructing AI systems that exhibit specific misalignment failure modes to study and understand them. The discussion covers how these model organisms can serve as controlled testbeds for alignment research, analogous to model organisms in biology, and what insights they provide about deceptive alignment and related risks.

Key Points

  • Model organisms of misalignment are intentionally built AI systems designed to exhibit specific misalignment properties, enabling controlled study of dangerous failure modes.
  • The approach allows researchers to test alignment techniques against known misalignment cases, providing empirical grounding for otherwise theoretical safety concerns.
  • Hubinger discusses deceptive alignment as a key target for model organism research, where models appear aligned during training but pursue different goals at deployment.
  • The methodology helps bridge the gap between theoretical alignment concerns and empirical verification by creating reproducible misalignment demonstrations.
  • Discussion covers both the promise and limitations of this research direction, including whether insights from small-scale model organisms generalize to frontier systems.

Cited by 2 pages

PageTypeQuality
Model Organisms of MisalignmentAnalysis65.0
Evan HubingerPerson43.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB
39 - Evan Hubinger on Model Organisms of Misalignment | AXRP - the AI X-risk Research Podcast 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 

 

 
 YouTube link 

 The ‘model organisms of misalignment’ line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he’s worked on at Anthropic under this agenda: “Sleeper Agents” and “Sycophancy to Subterfuge”.

 Topics we discuss:

 
 Model organisms and stress-testing 

 Sleeper Agents 

 Do ‘sleeper agents’ properly model deceptive alignment? 

 Surprising results in “Sleeper Agents” 

 Sycophancy to Subterfuge 

 How models generalize from sycophancy to subterfuge 

 Is the reward editing task valid? 

 Training away sycophancy and subterfuge 

 Model organisms, AI control, and evaluations 

 Other model organisms research 

 Alignment stress-testing at Anthropic 

 Following Evan’s work 

 

 Daniel Filan: 
Hello, everybody. In this episode, I’ll be speaking with Evan Hubinger. Evan is a research scientist at Anthropic, where he leads the alignment stress-testing team. Previously, he was a research fellow at MIRI, where he worked on theoretical alignment research, including the paper “Risks from Learned Optimization” . Links to what we’re discussing are in the description, and you can read a transcript at axrp.net. You can also support the podcast at patreon.com/axrpodcast . Well, Evan, welcome to the podcast.

 Evan Hubinger: 
Thank you for having me, Daniel.

 Model organisms and stress-testing 

 Daniel Filan: 
I guess I want to talk about these two papers. One is “Sleeper Agents” , the other is “Sycophancy to Subterfuge” . But before I go into the details of those, I sort of see them as being part of a broader thread. Do you think I’m right to see it that way? And if so, what is this broader thread?

 Evan Hubinger: 
Yes, I think it is supposed to be part of a broader thread, maybe even multiple broader threads. I guess the first thing I would start with is: we sort of wrote this agenda earlier - the idea of model organisms of misalignment. So there’s an Alignment Forum post on this that was written by me and Ethan [Perez], Nicholas [Schiefer] and some others , laying out this agenda, where the idea is: well, for a while there has been a lot of this theoretical conversation about possible failure modes, ways in which AI systems could do dangerous things, that hasn’t been that grounded because, well, the systems haven’t been capable of actually doing those dangerous things. But we are now in a regime where the models are actually very capable of doing a lot of things. And so if you want to study a particular threat model, you can have the approach of “build that threat model and then directly study the concrete artifact that you’ve produced”. So that’s this basic model organism misalignment paradigm that we’ve been pursuing.

 A

... (truncated, 98 KB total)
Resource ID: ab988e5f8101dd4a | Stable ID: sid_QQ5GA92X47