Back
Scott Emmons – AI Safety Researcher
webscottemmons.com·scottemmons.com
Personal homepage of Scott Emmons, an AI safety researcher at Anthropic (formerly Google DeepMind and UC Berkeley CHAI), listing his publications on chain-of-thought monitorability, jailbreak evaluation, and alignment frameworks.
Metadata
Importance: 62/100homepage
Summary
Scott Emmons is an AI safety researcher at Anthropic whose work spans chain-of-thought monitorability, latent-space defenses, jailbreak benchmarking (StrongREJECT), and assistance game theory. His research both develops alignment frameworks and stress-tests their limits. He also cofounded FAR.AI, a nonprofit advancing trustworthy AI.
Key Points
- •Established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment.
- •Designed practical metrics for model developers to preserve chain-of-thought monitorability.
- •Showed that obfuscated activations can bypass latent-space defenses (ICLR 2026).
- •Developed StrongREJECT, a jailbreak benchmark adopted by OpenAI, US/UK AISI, Amazon, and others.
- •Cofounded FAR.AI, a research and education nonprofit focused on trustworthy and secure AI.
Cached Content Preview
HTTP 200Fetched Apr 9, 20268 KB
Scott Emmons
Scott Emmons
I research AI safety and alignment at Anthropic. Before that, I was a research scientist at Google DeepMind. I completed my PhD at UC Berkeley's Center for Human-Compatible AI, advised by Stuart Russell. At that time, I cofounded FAR.AI, a research and education nonprofit advancing the global field of trustworthy and secure AI.
I develop AI alignment frameworks, stress-test their limits, and turn insights into methodology adopted across the field. I have established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment, designed practical metrics for model developers to preserve chain-of-thought monitorability, shown that obfuscated activations can bypass latent-space defenses, and developed StrongREJECT , a jailbreak benchmark now used by OpenAI, US/UK AISI, Amazon, and others.
Curriculum Vitae
scott at scottemmons dot com
Publications
Obfuscated Activations Bypass LLM Latent-Space Defenses
Luke Bailey * , Alex Serrano * , Abhay Sheshadri * , Mikhail Seleznyov * , Jordan Taylor * , Erik Jenner * , Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons
ICLR, 2026
[ project page ] [ code ] [ arXiv ] [ BibTeX ]
Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies
Miles Brundage, Noemi Dreksler, Aidan Homewood, Sean McGregor, ..., Scott Emmons, ..., and Ryan Tovcimak
arXiv, 2026
[ project page ] [ arXiv ] [ BibTeX ]
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
Max McGuinness * , Alex Serrano * , Luke Bailey, and Scott Emmons
arXiv, 2025
[ project page ] [ arXiv ] [ BibTeX ]
A Pragmatic Way to Measure Chain-of-Thought Monitorability
Scott Emmons * , Roland S. Zimmermann * , David K. Elson, and Rohin Shah
arXiv, 2025
[ arXiv ] [ BibTeX ]
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Tomek Korbak * , Mikita Balesni * , ..., Scott Emmons † , ..., Bowen Baker ‡ , Rohin Shah ‡ , and Vlad Mikulik ‡
arXiv, 2025
[ arXiv ] [ BibTeX ]
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah
arXiv, 2025
[ arXiv ] [ BibTeX ]
An Approach to Technical AGI Safety and Security
Rohin Shah, ..., Scott Emmons * , ..., and Anca Dragan
arXiv, 2025
[ arXiv ] [ BibTeX ]
Observation Interference in Partially Observable Assistance Games
Scott Emmons * , Caspar Oesterheld * , Vincent Conitzer, and Stuart Russell
International Conference on Machine Learning, 2025
[ arXiv ] [ BibTeX ]
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Ch
... (truncated, 8 KB total)Resource ID:
ecd0007ec1499477 | Stable ID: sid_aGPOAETiiA