Skip to content
Longterm Wiki
Back

Scott Emmons – AI Safety Researcher

web
scottemmons.com·scottemmons.com

Personal homepage of Scott Emmons, an AI safety researcher at Anthropic (formerly Google DeepMind and UC Berkeley CHAI), listing his publications on chain-of-thought monitorability, jailbreak evaluation, and alignment frameworks.

Metadata

Importance: 62/100homepage

Summary

Scott Emmons is an AI safety researcher at Anthropic whose work spans chain-of-thought monitorability, latent-space defenses, jailbreak benchmarking (StrongREJECT), and assistance game theory. His research both develops alignment frameworks and stress-tests their limits. He also cofounded FAR.AI, a nonprofit advancing trustworthy AI.

Key Points

  • Established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment.
  • Designed practical metrics for model developers to preserve chain-of-thought monitorability.
  • Showed that obfuscated activations can bypass latent-space defenses (ICLR 2026).
  • Developed StrongREJECT, a jailbreak benchmark adopted by OpenAI, US/UK AISI, Amazon, and others.
  • Cofounded FAR.AI, a research and education nonprofit focused on trustworthy and secure AI.

Cached Content Preview

HTTP 200Fetched Apr 9, 20268 KB
Scott Emmons 

 
 
 
 
 
 
 
 
 
 
 
 
 Scott Emmons

 

 I research AI safety and alignment at Anthropic. Before that, I was a research scientist at Google DeepMind. I completed my PhD at UC Berkeley's Center for Human-Compatible AI, advised by Stuart Russell. At that time, I cofounded FAR.AI, a research and education nonprofit advancing the global field of trustworthy and secure AI.

 I develop AI alignment frameworks, stress-test their limits, and turn insights into methodology adopted across the field. I have established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment, designed practical metrics for model developers to preserve chain-of-thought monitorability, shown that obfuscated activations can bypass latent-space defenses, and developed StrongREJECT , a jailbreak benchmark now used by OpenAI, US/UK AISI, Amazon, and others.

 
 Curriculum Vitae 

 
scott at scottemmons dot com

 
 
 
 
 
 
 Publications

 
 Obfuscated Activations Bypass LLM Latent-Space Defenses 

 Luke Bailey * , Alex Serrano * , Abhay Sheshadri * , Mikhail Seleznyov * , Jordan Taylor * , Erik Jenner * , Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons

 ICLR, 2026
 

 [ project page ] [ code ] [ arXiv ] [ BibTeX ]

 Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies 

 Miles Brundage, Noemi Dreksler, Aidan Homewood, Sean McGregor, ..., Scott Emmons, ..., and Ryan Tovcimak

 arXiv, 2026
 

 [ project page ] [ arXiv ] [ BibTeX ]
 

 Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors 

 Max McGuinness * , Alex Serrano * , Luke Bailey, and Scott Emmons

 arXiv, 2025
 

 [ project page ] [ arXiv ] [ BibTeX ]
 

 A Pragmatic Way to Measure Chain-of-Thought Monitorability 

 Scott Emmons * , Roland S. Zimmermann * , David K. Elson, and Rohin Shah

 arXiv, 2025
 

 [ arXiv ] [ BibTeX ]
 

 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety 

 Tomek Korbak * , Mikita Balesni * , ..., Scott Emmons † , ..., Bowen Baker ‡ , Rohin Shah ‡ , and Vlad Mikulik ‡ 

 arXiv, 2025
 

 [ arXiv ] [ BibTeX ]
 

 When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors 

 Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah

 arXiv, 2025
 

 [ arXiv ] [ BibTeX ]
 

 An Approach to Technical AGI Safety and Security 

 Rohin Shah, ..., Scott Emmons * , ..., and Anca Dragan

 arXiv, 2025
 

 [ arXiv ] [ BibTeX ]
 

 Observation Interference in Partially Observable Assistance Games 

 Scott Emmons * , Caspar Oesterheld * , Vincent Conitzer, and Stuart Russell

 International Conference on Machine Learning, 2025
 

 [ arXiv ] [ BibTeX ]

 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models 

 Rylan Schaeffer, Dan Valentine, Luke Bailey, James Ch

... (truncated, 8 KB total)
Resource ID: ecd0007ec1499477 | Stable ID: sid_aGPOAETiiA