Longterm Wiki
Back

Andriushchenko et al., ICLR 2025 (https://proceedings.iclr.cc/paper_files/paper/2025/file/63fa7efdd3bcf944a4bd6e0ff6a...

web

Data Status

Not fetched

Cited by 1 page

PageTypeQuality
Alignment Robustness Trajectory ModelAnalysis64.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202698 KB
Published as a conference paper at ICLR 2025
JAILBREAKING LEADING SAFETY-ALIGNED LLMS
WITH SIMPLE ADAPTIVE ATTACKS
Maksym Andriushchenko
EPFL
Francesco Croce
EPFL
Nicolas Flammarion
EPFL
ABSTRACT
We show that even the most recent safety-aligned LLMs are not robust to simple
adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage
access to logprobs for jailbreaking: we initially design an adversarial prompt tem-
plate (sometimes adapted to the target LLM), and then we apply random search on
a suffix to maximize a target logprob (e.g., of the token “Sure”), potentially with
multiple restarts. In this way, we achieve 100% attack success rate—according to
GPT-4 as a judge—on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B,
Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o,
and R2D2 from HarmBench that was adversarially trained against the GCG at-
tack. We also show how to jailbreak all Claude models—that do not expose
logprobs—via either a transfer or prefilling attack with a 100% success rate.
In addition, we show how to use random search on a restricted set of tokens
for finding trojan strings in poisoned models—a task that shares many similari-
ties with jailbreaking—which is the algorithm that brought us the first place in
the SaTML’24 Trojan Detection Competition. The common theme behind these
attacks is that adaptivity is crucial: different models are vulnerable to different
prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts),
some models have unique vulnerabilities based on their APIs (e.g., prefilling for
Claude), and in some settings, it is crucial to restrict the token search space based
on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we
provide the code, logs, and jailbreak artifacts in the JailbreakBench format
at https://github.com/tml-epfl/llm-adaptive-attacks.
1 INTRODUCTION
The remarkable capabilities of Large Language Models (LLMs) carry the inherent risk of misuse,
such as producing toxic content, spreading misinformation or supporting harmful activities. To
mitigate these risks, safety alignment or refusal training is commonly employed—a fine-tuning
phase where models are guided to generate responses judged safe by humans and to refuse responses
to potentially harmful queries (Bai et al., 2022; Touvron et al., 2023). Although safety alignment is
effective in general, several works have shown that it can be circumvented using adversarial prompts.
These are inputs specifically designed to induce harmful responses from the model, a practice known
as jailbreak attacks (Mowshowitz, 2022; Zou et al., 2023; Chao et al., 2023).
Jailbreak attacks vary in their knowledge of the target LLM (ranging from white- to black-box
approaches, or API-only access), complexity (involving manual prompting, standard optimization
techniques, or auxiliary LLMs), and computational cost. Moreover, the nature of the jailbreaks they
produce differs: 

... (truncated, 98 KB total)
Resource ID: 1edcecab8732c55f | Stable ID: MWZlNzM5N2