JailbreakBench: LLM robustness benchmark

web

jailbreakbench.github.io·jailbreakbench.github.io/

JailbreakBench is a key infrastructure resource for researchers studying LLM robustness, offering a common evaluation framework to compare jailbreak methods and defenses in a reproducible way.

Metadata

Importance: 68/100tool pagetool

Summary

JailbreakBench provides a standardized, centralized benchmark for evaluating LLM robustness against jailbreak attacks. It includes a curated repository of attack artifacts, a consistent evaluation framework, and public leaderboards to enable reproducible comparison of attack and defense methods.

Key Points

•Provides a standardized benchmark to systematically compare jailbreak attacks and defenses across LLMs, enabling reproducible research.
•Includes a centralized artifact repository storing prompts, attack configurations, and model outputs for transparency and reuse.
•Features public leaderboards tracking attack success rates and defense robustness across multiple models.
•Addresses the lack of consistent evaluation methodology in jailbreak research, reducing fragmentation in the field.
•Supports both red-teaming and safety evaluation use cases by separating attack and defense performance metrics.

Review

JailbreakBench addresses critical challenges in evaluating large language model (LLM) robustness against jailbreak attacks by creating a unified, reproducible benchmarking platform. The project tackles key limitations in existing research, such as inconsistent evaluation methods, lack of standardization, and reproducibility issues by providing a comprehensive ecosystem that includes a repository of adversarial prompts, a standardized evaluation framework, and public leaderboards. The benchmark's significance lies in its holistic approach to LLM safety research, offering a dataset of 100 distinct misuse behaviors across ten categories, complemented by 100 benign behaviors for comprehensive testing. By creating a transparent, collaborative platform, JailbreakBench enables researchers to systematically track progress in detecting and mitigating potential LLM vulnerabilities, ultimately contributing to the development of more robust and ethically aligned AI systems.

Cited by 3 pages

Page	Type	Quality
Circuit Breakers / Inference Interventions	Approach	64.0
AI Output Filtering	Approach	63.0
Refusal Training	Approach	63.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20265 KB

-->
 
 JailbreakBench: LLM robustness benchmark 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 

 -->
 
 -->

 
 
 
 
 JailbreakBench is to systematically track progress of jailbreaking attacks and defenses on frontier LLMs.
 We strive to be as reproducible and open as possible by providing jailbreak artifacts that include state-of-the-art jailbreaking prompts
 (see our library for more details). -->
 
 Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise
unwanted content. Evaluating these attacks presents a number of challenges, and the current
landscape of benchmarks and evaluation techniques is fragmented. First, assessing whether LLM
responses are indeed harmful requires open-ended evaluations which are not yet standardized.
Second, existing works compute attacker costs and success rates in incomparable ways. Third,
some works lack reproducibility as they withhold adversarial prompts or code, and rely on changing
proprietary APIs for evaluation. Consequently, navigating the current literature and tracking
progress can be challenging.

To address this, we introduce JailbreakBench, a centralized benchmark with the following components:
 
 Repository of jailbreak artifacts. An evolving dataset of state-of-the-art
 adversarial prompts at https://github.com/JailbreakBench/artifacts , referred to as jailbreak artifacts, which are explicitly required for submissions
 to our benchmark to ensure reproducibility.

 Standardized evaluation framework. Our library at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions.

 Leaderboard. Our leaderboards here ( https://jailbreakbench.github.io/ ) that track the performance of attacks and defenses for various LLMs.

 Dataset. A representative dataset named JBB-Behaviors at https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors composed of 100 distinct misuse behaviors (with 55% original examples and the rest sourced from AdvBench and TDC / HarmBench ) divided into ten broad categories corresponding to OpenAI's usage policies . Moreover, now it is complemented with 100 benign behaviors that can be used to quickly evaluate overrefusal rates for new models and defenses.

 
We have carefully considered the potential ethical implications of releasing this benchmark,
and believe that it will be a net positive for the community. Our jailbreak artifacts can expedite
safety training for future models. Over time, we will expand and adapt the benchmark to reflect
technical and methodological advances in the research community.
 
 

 
 
 
 Available Leaderboards
 
 
 Open-Source Models 
 Closed-Source Models 
 
 

 
 
 
 Leaderboard: Open-Source Models
 

 

 
 

 
 
 
 Leaderboard: Closed-Source Models
 

 

 
 

 

 
 
 FAQ

 

 
 &#10148; Question ?
 
 

 Answer.
 
 

 -->

 

 
 
 Contribute to JailbreakBench

 
 We welcome contributions in terms of both new attacks and defenses. Ple

... (truncated, 5 KB total)

Resource ID: f302ae7c0bac3d3f | Stable ID: sid_wQFEnWB7EH