Back
Constitutional Classifiers arXiv paper (https://arxiv.org/pdf/2501.18837)
paperarxiv.org·arxiv.org/pdf/2501.18837
Data Status
Not fetched
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Adversarial Training | Approach | 58.0 |
Cached Content Preview
HTTP 200Fetched Feb 23, 202698 KB
[2501.18837] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
\doparttoc \faketableofcontents
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Mrinank Sharma *+ Meg Tong * Jesse Mu * Jerry Wei * Jorrit Kruthoff * Scott Goodfriend * Euan Ong * Alwin Peng
Raj Agarwal Cem Anil Amanda Askell Nathan Bailey Joe Benton Emma Bluemke Samuel R. Bowman Eric Christiansen Hoagy Cunningham Andy Dau Anjali Gopal Rob Gilson Logan Graham Logan Howard Nimit Kalra ∘ Taesung Lee Kevin Lin Peter Lofgren Francesco Mosconi Clare O’Hara Catherine Olsson Linda Petrini □ Samir Rajani Nikhil Saxena Alex Silverstein Tanya Singh Theodore Sumers Leonard Tang ∘ Kevin K. Troy Constantin Weisser ∘ Ruiqi Zhong Giulio Zhou
Jan Leike Jared Kaplan Ethan Perez +
Safeguards Research Team, Anthropic
Abstract
Large language models (LLMs) are vulnerable to universal jailbreaks—prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale.
To defend against these attacks, we introduce Constitutional Classifiers : safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content.
In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.
On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks.
These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.
Our work demonstrates that defending
against universal jailbreaks while maintaining practical deployment viability is tractable.
* * footnotetext:
Equal contribution.
+ Equal advising.
∘ Haize Labs.
□ Independent.
Correspondence to <mrinank@anthropic.com> .
First and last author blocks are core contributors, middle authors are listed alphabetically.
See Author Contributions for author contributions.
1 Introduction
Large language model (LLM) safety mechanisms can be circumvented by “jailbreaks” that elicit harmful information from models (Shen et al., 2023 ; Liu et al., 2023 ; Qi et al., 2024 ; Andriushchenko et al., 2024 ; Anil et al., 2024 ; Hughes et al., 2024 ) .
Such jailbreaks become more concerning as the chemical, biological, radiological, or nuclear (CBRN) capabilities of LLMs increase (Anthropic, 2023a ; OpenAI, 2023 ; Li et al., 2024 ) . 2 2 2 This work was conducted as part of Anthropic’s Responsible Scaling Policy commitments to pr
... (truncated, 98 KB total)Resource ID:
2d454deae01c7a1e | Stable ID: OWI5NTE1OT