Longterm Wiki
Back

Constitutional Classifiers arXiv paper (https://arxiv.org/pdf/2501.18837)

paper

Data Status

Not fetched

Cited by 1 page

PageTypeQuality
Adversarial TrainingApproach58.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202698 KB
[2501.18837] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \doparttoc \faketableofcontents 
 
 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

 
 
 

 
 Mrinank Sharma *+  Meg Tong *  Jesse Mu *  Jerry Wei *  Jorrit Kruthoff *  Scott Goodfriend *  Euan Ong *  Alwin Peng 
 Raj Agarwal  Cem Anil  Amanda Askell  Nathan Bailey  Joe Benton  Emma Bluemke  Samuel R. Bowman  Eric Christiansen Hoagy Cunningham  Andy Dau  Anjali Gopal  Rob Gilson  Logan Graham  Logan Howard  Nimit Kalra ∘  Taesung Lee  Kevin Lin  Peter Lofgren  Francesco Mosconi  Clare O’Hara Catherine Olsson  Linda Petrini □  Samir Rajani  Nikhil Saxena  Alex Silverstein  Tanya Singh  Theodore Sumers  Leonard Tang ∘  Kevin K. Troy  Constantin Weisser ∘  Ruiqi Zhong  Giulio Zhou 
 Jan Leike  Jared Kaplan  Ethan Perez + 
 
 Safeguards Research Team, Anthropic 
 
 
 

 
 Abstract

 Large language models (LLMs) are vulnerable to universal jailbreaks—prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale.
To defend against these attacks, we introduce Constitutional Classifiers : safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content.
In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.
On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks.
These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.
Our work demonstrates that defending
against universal jailbreaks while maintaining practical deployment viability is tractable.

 
 * * footnotetext: 
Equal contribution.
 + Equal advising.
 ∘ Haize Labs.
 □ Independent.
Correspondence to <mrinank@anthropic.com> .
First and last author blocks are core contributors, middle authors are listed alphabetically.
See Author Contributions for author contributions.
 
 
 
 1 Introduction

 
 Large language model (LLM) safety mechanisms can be circumvented by “jailbreaks” that elicit harmful information from models (Shen et al., 2023 ; Liu et al., 2023 ; Qi et al., 2024 ; Andriushchenko et al., 2024 ; Anil et al., 2024 ; Hughes et al., 2024 ) .
Such jailbreaks become more concerning as the chemical, biological, radiological, or nuclear (CBRN) capabilities of LLMs increase (Anthropic, 2023a ; OpenAI, 2023 ; Li et al., 2024 ) . 2 2 2 This work was conducted as part of Anthropic’s Responsible Scaling Policy commitments to pr

... (truncated, 98 KB total)
Resource ID: 2d454deae01c7a1e | Stable ID: OWI5NTE1OT