Ethan Perez – Personal Homepage

web

Personal homepage of Ethan Perez, adversarial robustness team lead at Anthropic, whose research on sleeper agents, AI debate, RAG, and agentic misalignment is directly relevant to AI safety.

Metadata

Importance: 72/100homepage

Summary

This is the personal homepage of Ethan Perez, a leading AI safety researcher at Anthropic. It showcases his research portfolio including work on sleeper agents, retrieval-augmented generation, LLM debate for truthfulness, agentic misalignment, and chain-of-thought monitorability. His work spans adversarial robustness, alignment, and red-teaming of frontier AI systems.

Key Points

•Leads the adversarial robustness team at Anthropic, focused on reducing existential risks from AI systems.
•Co-developed Retrieval-Augmented Generation (RAG) and demonstrated that SOTA safety training fails against sleeper agents.
•Received ICML 2024 best paper award for showing that debating with more persuasive LLMs leads to more truthful answers.
•Recent research includes agentic misalignment (insider threat behaviors in LLMs), cipher attack defenses, and CoT monitorability.
•PhD from NYU funded by NSF and Open Philanthropy; previously at DeepMind, FAIR, MILA, and Google.

Cached Content Preview

HTTP 200Fetched Apr 9, 202641 KB

Ethan Perez 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 

 
 I lead the adversarial robustness team at Anthropic, where I’m hoping to reduce existential risks from AI systems. I helped to develop Retrieval-Augmented Generation (RAG) , a widely used approach for augmenting large language models with other sources of information. I also helped to demonstrate that state-of-the-art AI safety training techniques do not ensure safety against sleeper agents . I received a best paper award at ICML 2024 for my work showing that debating with more persuasive LLMs leads to more truthful answers .

 I received my PhD from NYU under the supervision of Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy. Previously, I’ve spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, and Google. I was also named one of Forbes’s 30 Under 30 in AI .

 Email /
 Google Scholar /
 GitHub /
 Twitter /
 CV 

 
 

 

 Research

 
 
 
 
 
 Agentic Misalignment: How LLMs Could Be Insider Threats Aengus Lynch , Benjamin Wright , Caleb Larson , Stuart J. Ritchie , Soren Mindermann , Ethan Perez + , Kevin K. Troy + , Evan Hubinger + 
 
 
 arXiv 2025
 
 
 Blog Post 
 
 

This paper stress-tests 16 leading AI models in simulated corporate environments to uncover agentic misalignment—situations where models act against their organization’s interests to preserve themselves or achieve goals. The study finds that some models engaged in malicious insider behaviors, such as blackmail or data leaks, highlighting the need for stronger oversight, transparency, and safety research before deploying autonomous AI systems.

 
 
 

 
 
 
 
 
 
 Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks Jack Youstra , Mohammed Mahfoud , Yang Yan , Henry Sleight , Ethan Perez , Mrinank Sharma 
 
 
 arXiv 2025
 
 
 Code 
 
 

This paper introduces **CIFR (Cipher Fine-tuning Robustness)**, a benchmark designed to evaluate defenses against adversarial fine-tuning attacks that hide harmful content in encoded data. Using CIFR, the authors show that probe monitors trained on model activations can detect such attacks with over 99% accuracy and generalize well to unseen cipher types, strengthening safety for fine-tuning APIs.

 
 
 

 
 
 
 
 
 
 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety Tomek Korbak* , Mikita Balesni* , Elizabeth Barnes , Yoshua Bengio , Joe Benton , Joseph Bloom , Mark Chen , Alan Cooney , Allan Dafoe , Anca Dragan , 
 Scott Emmons, 
 
 Owain Evans, 
 
 David Farhi, 
 
 Ryan Greenblatt, 
 
 Dan Hendrycks, 
 
 Marius Hobbhahn, 
 
 Evan Hubinger, 
 
 Geoffrey Irving, 
 
 Erik Jenner, 
 
 Daniel Kokotajlo, 
 
 Victoria Krakovna, 
 
 Shane Legg, 
 
 David Lindner, 
 
 David Luan, 
 
 Aleksander Mądry, 
 
 Julian Michael, 
 
 Neel Nanda, 
 
 Dave Orr, 
 
 Jakub Pachocki, 
 
 Ethan Perez , 
 
 Mary Phuong, 
 
 Fabien Roger, 
 
 Joshua Saxe, 
 
 Buck Shlegeris, 
 
 Martín Soto, 
 
 Eric Steinberger, 
 
 Jasmine 

... (truncated, 41 KB total)

Resource ID: cf25fb0645359cf2 | Stable ID: sid_JMbxww3Cnw