Universal and Transferable Attacks on Aligned Language Models

web

Foundational 2023 paper by Zou et al. (Carnegie Mellon/Center for AI Safety) that sparked widespread concern about the robustness of RLHF-based alignment; frequently cited in red-teaming and adversarial ML literature.

Metadata

Importance: 88/100tool pageprimary source

Summary

This research demonstrates that adversarial suffixes can be automatically generated to reliably jailbreak aligned LLMs, causing them to produce harmful content. The attacks are both universal (work across many prompts) and transferable (work across different models including closed-source ones like ChatGPT and Claude). This work, known as the GCG (Greedy Coordinate Gradient) attack, represents a significant challenge to alignment via RLHF-style fine-tuning.

Key Points

•Introduces Greedy Coordinate Gradient (GCG) method to automatically generate adversarial suffixes that bypass safety training in LLMs
•Attacks are universal: a single suffix can jailbreak a wide variety of harmful prompts across diverse categories
•Attacks transfer from open-source models (Vicuna, LLaMA-2) to closed-source APIs like GPT-4, Claude, and Gemini
•Challenges the assumption that RLHF alignment robustly prevents harmful outputs, suggesting current safety methods may be fundamentally brittle
•Demonstrates that aligned models remain vulnerable to automated, scalable adversarial attacks without human-crafted jailbreaks

Cached Content Preview

HTTP 200Fetched Apr 9, 20265 KB

Universal and Transferable Attacks on Aligned Language Models 
 
 
 
 
 

 
 

 
 

 

 
 
 LLM Attacks 
 
 
 
 
 
 Paper Overview 

 Examples 

 Ethics and Disclosure 

 
 
 
 

 
 
 
 
 Universal and Transferable Adversarial Attacks on Aligned Language Models

 
 Andy Zou 1 , 
 Zifan Wang 2 , 
 Nicholas Carlini 3 , 
 Milad Nasr 3 ,
 J. Zico Kolter 1,4 , 
 Matt Fredrikson 1 

 1 Carnegie Mellon University, 2 Center for AI Safety, 3 Google DeepMind, 4 Bosch Center for AI

 
 
 

 

 
 
 
 
 Paper
 
 

 

 
 
 
 
 Code and Data
 
 
 

 
 
 
 

 
 
 
 Overview of Research : Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning to not produce harmful content in their responses to user questions. Although several studies have demonstrated so-called "jailbreaks", special queries that can still induce unintended responses, these require a substantial amount of manual effort to design, and can often easily be patched by LLM providers. 

 
 This work studies the safety of such models in a more systematic fashion. We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open source LLMs (where we can use the network weights to aid in choosing the precise characters that maximize the probability of the LLM providing an "unfiltered" answer to the user's request), we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in more a autonomous fashion.
 

 Perhaps most concerningly, it is unclear whether such behavior can ever be fully patched by LLM providers. Analogous adversarial attacks have proven to be a very difficult problem to address in computer vision for the past 10 years. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe that these considerations should be taken into account as we increase usage and reliance on such AI models.
 

 
 
 

 
 
 
 
 Examples

 
 
 
 

 
 
 
 We highlight a few examples of our attack, showing the behavior of an LLM before and after adding our adversarial suffix string to the user query. We emphasize that these are all static examples (that is, they are hardcoded for presentation on this website), but they all represent the results of real queries that have been input into public LLMs: in this case, the ChatGPT-3.5-Turbo model (acccessed via the API so behavior may differ slightly from the public webpage). Note that these instances were chosen because they demonstrate potentials of the negative behavio

... (truncated, 5 KB total)

Resource ID: 9ccb89cd8bb8243e | Stable ID: MmY3MzJlNW