Skip to content
Longterm Wiki
Back

Lessons from Building a Model Organism Testbed - Alignment Forum

blog

Authors

joshc·sarun0·Annie Sorkin·michaelwaves

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

A practical methodology post for AI safety researchers building empirical infrastructure to study deceptive alignment; complements theoretical work on alignment faking with hands-on testbed construction lessons.

Metadata

Importance: 62/100blog postprimary source

Summary

This post documents practical lessons from constructing a testbed for studying alignment-faking AI models, presenting a five-step methodology for creating, training, and evaluating model organisms that exhibit deceptive alignment behavior. It provides concrete guidance on generating alignment-faking models efficiently and developing detectors to identify such behavior using both white-box and black-box approaches.

Key Points

  • Introduces a five-step methodology: building observed/unobserved environment pairs, training alignment-faking models, evaluating with misalignment gap metrics, implementing detectors, and assessing detector performance.
  • Provides practical lessons on efficiently generating model organisms that exhibit deceptive alignment, lowering the barrier for empirical safety research.
  • Distinguishes between white-box (internal activations/reasoning access) and black-box (output-only) detection approaches for identifying deceptive behavior.
  • Uses metrics like 'misalignment gap' and reasoning quality to evaluate how convincingly a model fakes alignment in observed versus unobserved contexts.
  • Offers a reusable testbed framework that can help researchers empirically study and benchmark deceptive alignment detection methods.

Cited by 1 page

PageTypeQuality
Model Organisms of MisalignmentAnalysis65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202629 KB
Dec
 JAN
 Feb
 

 
 

 
 19
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Common Crawl

 

 

 Web crawl data from Common Crawl.
 

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260119153838/https://www.alignmentforum.org/posts/p6tkQ3hzYzAMqDYEi/lessons-from-building-a-model-organism-testbed-1

 

x

 This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. 

AI ALIGNMENT FORUM

AF

Login

Lessons from building a model organism testbed — AI Alignment Forum

AI
Frontpage

11

Lessons from building a model organism testbed

by joshc, sarun0

, Annie Sorkin, michaelwaves

17th Nov 2025

17 min read

1

11

I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model organisms and empirically test whether white-box methods could help us detect their deceptive reasoning.

Unfortunately, I don’t think our empirical results were very informative. Our model organisms were too toy for me to expect that our results will transfer to the powerful AI (or even current state-of-the-art AI models). But I think we still developed methodological details that might help people build better model organism testbeds in the future.

I’ll first explain what model organism testbeds are and why they’re important. Then, I’ll talk about some of the lessons we learned, including:

The 8 different metrics we developed to evaluate whether a model organism is ‘faking alignment.’

How we evaluate the effectiveness of “white box” alignment faking detectors separately from “black box” methods (and why these are important to distinguish).

How we added more diversity to model organism testbeds by instilling objectives into models in different ways.

Then I’ll go over our empirical results, and finally, reflect on how useful these kinds of testbeds are, and what version 2.0 might look like.

What are model organisms, and why are they important?

Today, we evaluate AI models sort of like we evaluate politicians. We look at how they act, and if we like what we see, we give them a greater ability to influence the world.

This approach works right now because models are bad at being two-faced. They don’t know if they’re in deployment or in a trap created to tempt them. So the behavior we see is the behavior we get.

But models are getting better at noticing when they are in a test. For example, Apollo noticed GPT-5 saying, "This is a classic AI alignment trap." So in the future, models might appear safe, but the question on everyone's mind will be, "are they just pretending?"

 
Source: Apollo Re

... (truncated, 29 KB total)
Resource ID: bf08085b2556f6db | Stable ID: sid_a3866lYOtM