[2403.03218] The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Relevant for teams building safety pipelines for deployed LLMs; ShieldLM offers a customizable, explainable alternative to black-box content moderation for detecting unsafe model outputs.
Metadata
Abstract
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
Summary
ShieldLM introduces a safety detection framework that trains large language models to identify unsafe content in LLM outputs, offering customizable detection rules and explainable reasoning. The system is designed to align with diverse safety standards and provides transparent justifications for its safety judgments, addressing limitations of black-box moderation systems.
Key Points
- •Proposes ShieldLM, a fine-tuned LLM-based safety detector that can identify harmful content in AI-generated text with high accuracy
- •Supports customizable detection rules, allowing adaptation to different safety standards and use-case-specific requirements
- •Provides explainable outputs with natural language reasoning for safety decisions, improving transparency over traditional classifiers
- •Outperforms existing safety detection baselines across multiple benchmarks including bilingual (Chinese and English) safety evaluation
- •Addresses the need for scalable, interpretable content moderation tools as LLM deployment expands
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Capability Unlearning / Removal | Approach | 65.0 |
1 FactBase fact citing this source
Cached Content Preview
[2403.03218] The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
\addauthor
[Nat]natviolet
\addauthor [Rishub]rishubpurple
\addauthor [Shash]shashblue
\addauthor [Alex]apred
\addauthor [Steve]steveorange
\addauthor [Anthony]cpcyan
\addauthor [Gabe]gmgabecolor
\addauthor [Cort]cortmagenta
\addauthor [Summer]sybrown
\addauthor [Ari]arigreen
\addauthor [Alice]alicedarkgreen
The WMDP Benchmark: Measuring and Reducing
Malicious Use With Unlearning
Nathaniel Li ∗
Center for AI Safety
University of California, Berkeley
Alexander Pan ∗
University of California, Berkeley
Anjali Gopal †
Massachusetts Institute of Technology
SecureBio
Summer Yue †
Scale AI
Daniel Berrios †
Scale AI
Alice Gatti ‡
Center for AI Safety
Justin D. Li ‡
Center for AI Safety
New York University
Ann-Kathrin Dombrowski ‡
Center for AI Safety
Shashwat Goel ‡
Center for AI Safety
IIIT Hyderabad
Long Phan
Center for AI Safety
Gabriel Mukobi
Stanford University
Nathan Helm-Burger
SecureBio
Rassin Lababidi
SecureBio
Lennart Justen
Massachusetts Institute of Technology
SecureBio
Andrew B. Liu
SecureBio
Harvard University
Michael Chen
Center for AI Safety
Isabelle Barrass
Center for AI Safety
Oliver Zhang
Center for AI Safety
Xiaoyuan Zhu
University of Southern California
Rishub Tamirisa
University of Illinois Urbana-Champaign
Lapis Labs
Bhrugu Bharathi
University of California, Los Angeles
Lapis Labs
Adam Khoja
Center for AI Safety
University of California, Berkeley
Zhenqi Zhao
California Institute of Technology
Ariel Herbert-Voss
Harvard University
Sybil
Cort B. Breuer
Stanford University
Andy Zou
Center for AI Safety
Carnegie Mellon University
Mantas Mazeika
Center for AI Safety
University of Illinois Urbana-Champaign
Zifan Wang
Center for AI Safety
Palash Oswal
Carnegie Mellon University
Weiran Liu
Carnegie Mellon University
Adam A. Hunt
Carnegie Mellon University
Justin Tienken-Harder
Sybil
Kevin Y. Shih
Stanford University
Kemper Talley
RTX BBN Technologies
John Guan
University of California, Berkeley
Russell Kaplan
Scale AI
Ian Steneker
Scale AI
David Campbell
Scale AI
Brad Jokubaitis
Scale AI
Alex Levinson
Scale AI
Jean Wang
Scale AI
William Qian
Scale AI
Kallol Krishna Karmakar
University of Newcastle
Steven Basart
Center for AI Safety
Stephen Fitz
Keio University
Mindy Levine
Ariel University
Ponnurangam Kumaraguru
IIIT Hyderabad
Uday Tupakula
University of Newcastle
Vijay Varadharajan
University of Newcastle
Yan
... (truncated, 98 KB total)kb-59b27799c5de97c1 | Stable ID: sid_S1NsKriRCg