Back
JailbreakBench: LLM robustness benchmark
webjailbreakbench.github.io·jailbreakbench.github.io/
Data Status
Full text fetchedFetched Dec 28, 2025
Summary
JailbreakBench introduces a centralized benchmark for assessing LLM robustness against jailbreak attacks, including a repository of artifacts, evaluation framework, and leaderboards.
Key Points
- •Provides a centralized, reproducible benchmark for LLM jailbreak attacks and defenses
- •Offers standardized evaluation methods and a comprehensive dataset of misuse behaviors
- •Enables transparent tracking of LLM robustness across open-source and closed-source models
Review
JailbreakBench addresses critical challenges in evaluating large language model (LLM) robustness against jailbreak attacks by creating a unified, reproducible benchmarking platform. The project tackles key limitations in existing research, such as inconsistent evaluation methods, lack of standardization, and reproducibility issues by providing a comprehensive ecosystem that includes a repository of adversarial prompts, a standardized evaluation framework, and public leaderboards.
The benchmark's significance lies in its holistic approach to LLM safety research, offering a dataset of 100 distinct misuse behaviors across ten categories, complemented by 100 benign behaviors for comprehensive testing. By creating a transparent, collaborative platform, JailbreakBench enables researchers to systematically track progress in detecting and mitigating potential LLM vulnerabilities, ultimately contributing to the development of more robust and ethically aligned AI systems.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Circuit Breakers / Inference Interventions | Approach | 64.0 |
| AI Output Filtering | Approach | 63.0 |
| Refusal Training | Approach | 63.0 |
Resource ID:
f302ae7c0bac3d3f | Stable ID: MDE5ZTQ5OD