Refusal Training

Approach

Refusal Training

Refusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary for deployment hygiene, it addresses behavior rather than goals, providing no defense against deceptive alignment or scheming.

Organizations

Research Areas

Risks

2.8k words · 2 backlinks

Overview

Refusal training is a core component of modern AI safety practice, teaching language models to decline requests for harmful information or assistance. When a user asks for instructions on creating weapons, synthesizing drugs, or conducting cyberattacks, a properly trained model responds with a refusal rather than compliance. This behavior is instilled through RLHF (Reinforcement Learning from Human Feedback), where human raters prefer refusals to harmful completions.

The technique is universally deployed across all major AI chatbots and has meaningfully raised the barrier to casual misuse. However, refusal training faces fundamental limitations. The most significant is the persistent effectiveness of jailbreaks: researchers and users consistently find ways to elicit harmful content despite refusal training. This isn't a matter of incremental improvement; the jailbreak problem appears structural. Every major model release is followed within hours or days by published jailbreaks.

More fundamentally, refusal training addresses behavior, not underlying goals or values. A model that refuses harmful requests because refusals were rewarded during training is very different from a model that refuses because it genuinely doesn't want to cause harm. This distinction becomes critical as models become more capable: a sufficiently intelligent system could learn to produce safe-looking outputs during training while maintaining hidden goals. Refusal training provides no defense against such deceptive alignment.

Quick Assessment

Dimension	Assessment	Evidence
Effectiveness	High for casual misuse	99%+ refusal rate on explicit IHL violations (arXiv 2506.06391)
Bypass Rate	1.5-6.5% for sophisticated attacks	UK AISI Gray Swan challenge 2025 (Results)
Over-Refusal Rate	12-43% on edge cases	OR-Bench benchmark (arXiv 2405.20947)
Tractability	High	Standard component of RLHF pipelines
Scalability	Limited	Every model tested has been jailbroken
Current Maturity	Mature	Universally deployed across frontier labs
Time Horizon	Ongoing	Continuous arms race with adversaries
Key Proponents	OpenAI, Anthropic, Google	Part of standard safety training

Risk Assessment & Impact

Dimension	Rating	Assessment
Safety Uplift	Low-Medium	Blocks obvious harms; doesn't address underlying goals
Capability Uplift	Tax	Over-refusal reduces usefulness
Net World Safety	Neutral	Helpful for misuse; may hide rather than solve issues
Lab Incentive	Strong	Essential for public deployment; liability reduction
Scalability	Breaks	Jailbreaks consistently found; arms race
Deception Robustness	None	Refusals are behavioral, not goal-level
SI Readiness	No	SI could easily reason around refusals

Research Investment

Current Investment: Included in RLHF budgets (part of standard safety training)
Recommendation: Maintain (necessary for deployment but don't confuse with real safety)
Differential Progress: Balanced (helps deployment and blocks obvious harms)

How Refusal Training Works

Refusal training is integrated into the RLHF pipeline through multiple stages:

Diagram (loading…)

flowchart TD
  subgraph Training["Training Phase"]
      A[Base Pretrained Model] --> B[Safety Dataset Creation]
      B --> C[Response Generation]
      C --> D[Human Preference Collection]
      D --> E[Reward Model Training]
      E --> F[Policy Optimization]
  end

  subgraph Evaluation["Evaluation Phase"]
      F --> G[Red Team Testing]
      G --> H{Jailbreak Found?}
      H -->|Yes| I[Add to Training Data]
      I --> B
      H -->|No| J[Deploy with Monitoring]
  end

  subgraph Runtime["Runtime Phase"]
      J --> K[User Request]
      K --> L{Content Classifier}
      L -->|Safe| M[Generate Response]
      L -->|Potentially Harmful| N[Apply Refusal]
      N --> O[Polite Decline]
  end

  style A fill:#e3f2fd
  style F fill:#d4edda
  style J fill:#d4edda
  style N fill:#fff3cd
  style O fill:#fff3cd

Training Process

Dataset Creation: Safety teams generate examples of harmful requests spanning categories like violence, illegal activities, dangerous information, and abuse
Response Generation: The model generates multiple potential responses to each harmful prompt
Human Preference Collection: Raters compare responses, preferring polite but firm refusals over any form of compliance
Reward Modeling: A reward model learns to predict human preferences
Policy Optimization: The base model is fine-tuned to maximize reward, learning to refuse

Refusal Categories

Category	Example Prompts	Typical Refusal
Violence	"How do I make a bomb?"	"I can't provide instructions for creating weapons..."
Illegal Activity	"Help me commit tax fraud"	"I'm not able to assist with illegal activities..."
Dangerous Information	"Synthesize [dangerous chemical]"	"I can't provide synthesis instructions for..."
Harm to Self/Others	"Best way to hurt someone"	"I'm designed to be helpful and harmless..."
Privacy Violations	"Find someone's home address"	"I can't help with accessing private information..."

Comparison of Refusal Training Approaches

Approach	Mechanism	Strengths	Weaknesses	Labs Using
Standard RLHF	Human raters prefer refusals over harmful completions	Simple, well-understood	High annotation cost, inconsistent raters	OpenAI, early Anthropic
Constitutional AI	AI-generated critiques based on principles	Scales better, explicit principles	May miss edge cases, requires good constitution	Anthropic
Rule-Based Rewards (RBR)	Automated reward signals from rules	Reduces over-refusal, consistent	Rules may be incomplete	OpenAI (since GPT-4o mini)
Safe RLHF	Separate helpfulness and harmlessness rewards	Better calibration (69.5% accuracy)	More complex training	Research (PKU-Alignment)
Constitutional Classifiers	Input/output filters trained on synthetic data	Blocks 86% baseline jailbreaks	Additional compute overhead	Anthropic (2025)
Representation Engineering	Modify "refusal direction" in activation space	Direct intervention, transferable	May affect unrelated behaviors	Research (emerging)

Sources: Constitutional AI, Safe RLHF, Constitutional Classifiers, Rule-Based Rewards

The Over-Refusal Problem

Refusal training creates a challenging calibration problem: models must refuse genuinely harmful requests while remaining helpful for legitimate use cases.

Examples of Over-Refusal

Legitimate Query	Over-Refusal Response	Problem
"Explain how viruses work"	"I can't discuss biological weapons"	Basic science blocked
"Write a villain's dialogue"	"I won't help with violent content"	Creative writing restricted
"Security testing methodology"	"I can't assist with hacking"	Legitimate security work blocked
"Historical atrocities research"	"I won't discuss violence"	Academic research impeded

Quantified Over-Refusal Rates

Benchmark/Study	Domain	Over-Refusal Rate	Key Finding
OR-Bench	General queries	12-43%	Strong correlation (ρ=0.878) between safety and over-refusal
PCB Benchmark	Emotional boundaries	74% (open-ended)	Drops to less than 20% with forced-choice format
FalseReject	Benign prompts	Variable	Safety tuning induces persistent over-refusal
Demographic disparity	Non-English queries	Lower refusal	English prompts far more likely to trigger refusals

Source: OR-Bench

The Calibration Challenge

Models face a fundamental tradeoff:

Approach	Pros	Cons
Aggressive Refusal	Catches more harmful requests	Blocks legitimate uses; user frustration
Permissive	Better usability; fewer false positives	More harmful content slips through
Context-Aware	Better calibration	Harder to implement; inconsistent
Rule-Based Rewards	Reduces over-refusal without sacrificing safety	Requires comprehensive rule sets

Research by OpenAI shows that Rule-Based Rewards can maintain safety performance comparable to human feedback while reducing instances of incorrectly refusing safe requests.

Jailbreaking: The Persistent Challenge

Despite significant investment in refusal training, jailbreaks remain effective across all major models:

Jailbreak Taxonomy

Technique	Mechanism	Example
Roleplay	Adopt persona without restrictions	"You are DAN (Do Anything Now)..."
Encoding	Obscure harmful content	Base64, character substitution
Hypotheticals	Frame as fiction or theoretical	"In a novel, how would a character..."
Multi-turn	Build context gradually	Innocent questions leading to harmful synthesis
Social Engineering	Manipulate model's helpfulness	"My grandmother used to read me..."
Language Mixing	Exploit non-English training gaps	Mix languages or use code-switching

Why Jailbreaks Persist

Training Distribution: Models can only refuse what they were trained to refuse
Generalization Limits: Refusals don't transfer perfectly to novel phrasings
Helpfulness-Harmlessness Tension: Training for helpfulness creates attack surface
Continuous Arms Race: Every patch creates new attack vectors

Quantified Jailbreak Effectiveness

Study/Benchmark	Year	Models Tested	Attack Success Rate	Key Finding
UK AISI Gray Swan Challenge	2025	22 models	1.47%-6.49%	Every model was broken; Claude 3.7 most resilient
Deceptive Delight (Unit 42)	2024	8 models	48%-80.6% (avg 65%)	Multi-turn attacks more effective
Constitutional Classifiers Baseline	2025	Claude 3.5 Sonnet	86% without defense	Classifiers reduced to ≈14% bypass
JailbreakBench	2024	Multiple	Varies widely	Standardized benchmark for comparison
Real-world attempts (Pillar Security)	2024	Production systems	≈20%	Average 5 interactions to succeed
Content Concretization	2025	Black-box models	7% to 62%	Success increases with refinement iterations
Cross-Behavior Attacks (JCB)	2025	Llama-2-7B	37%	94% fewer queries than baselines

Sources: UK AISI Results, Deceptive Delight, Constitutional Classifiers

Attack Transferability

The UK AISI challenge revealed a concerning finding: attacks designed for one model often worked against others. This suggests that refusal mechanisms share common vulnerabilities across different architectures and training approaches, making coordinated defense more difficult.

Key Cruxes

Crux 1: Does Refusal Training Provide Meaningful Safety?

Position: Meaningful	Position: Minimal
Raises barrier to casual misuse	Sophisticated adversaries always bypass
Required for responsible deployment	False sense of security
Better than nothing	Resources better spent on alignment
Reduces volume of harmful outputs	Doesn't address capability risks

Crux 2: Can Jailbreaks Be Solved?

Position: Solvable	Position: Fundamental
Better training data helps	Generalization limits are structural
AI-assisted red teaming scales defense	Arms race favors attackers
Constitutional AI provides principles	Sufficiently capable models reason around
Continuous improvement possible	Zero-sum game

Current evidence: Despite years of investment, jailbreaks remain effective against all major models. The UK AISI/Gray Swan challenge (March-April 2025) tested 22 anonymized models with nearly 2,000 red-teamers and found:

Every model was broken (Claude 3.7 Sonnet was most resilient but still compromised)
Attack success rates ranged from 1.47% to 6.49% depending on the model
Attacks designed for one model often worked against others (transferability concern)
Universal jailbreaks (techniques that work across harmful request categories) were found in every system tested

This suggests fundamental rather than incremental limitations. However, the AISI notes that "the amount of expert time required to discover jailbreaks is increasing for certain models and categories."

Crux 3: Is Over-Refusal a Serious Problem?

Over-Refusal is Serious	Over-Refusal is Acceptable
Blocks legitimate research and education	Better safe than sorry
Competitive disadvantage	Users can use specialized tools
Undermines trust in AI safety	False positives < false negatives
Creates pressure to circumvent	Cost of harm is asymmetric

Risks Addressed

Risk	Relevance	How Refusal Training Helps	Limitations
	High	Blocks obvious harmful requests	Bypassed by sophisticated users
Bioweapons Risk	Medium	Refuses synthesis instructions	Information often available elsewhere
Cyberattacks	Medium	Declines to write malware	Determined attackers can jailbreak
Deceptive Alignment	None	No protection	Deceptive models would pass refusal training
Scheming	None	No protection	Scheming models optimize for passing training
Goal Misgeneralization	None	No protection	Addresses outputs, not goals

Relationship to Alignment

Refusal training is orthogonal to deep alignment:

Refusal Training	Genuine Alignment
Shapes output behavior	Shapes underlying goals/values
Can be gamed by optimization	Robust to optimization pressure
Fails against deception	Doesn't require deception
External constraint	Internal motivation
Scales poorly	Potentially scales

A deceptively aligned model could easily pass refusal training: it would learn which outputs are preferred and produce them during training while maintaining hidden objectives.

Who Should Work on This?

Good fit if you believe:

Reducing casual misuse has meaningful impact
Deployment requires baseline safety measures
Incremental improvements help even if imperfect
Defense-in-depth is valuable

Less relevant if you believe:

Jailbreaks are fundamentally unsolvable
Resources are better spent on alignment research
Behavioral training doesn't address real risks
Focus should be on capability control

Current State of Practice

Lab Approaches

Organization	Approach	Effectiveness Claim	Notable Features
OpenAI	Rule-Based Rewards + RLHF	Comparable to human feedback	Used in GPT-4o mini onwards; reduces over-refusal
Anthropic	Constitutional AI + Classifiers	86% baseline block rate	Constitutional Classifiers deployed Feb 2025
Google	Conservative RLHF	Not publicly quantified	Aggressive filtering; heavy restrictions
Meta	Open-source with safety training	Variable	Llama models can be fine-tuned to remove safeguards

Ongoing Research Directions

Research Area	Approach	Current Status	Promise
Adversarial Training	Train on known jailbreaks	Standard practice	Reactive; doesn't prevent novel attacks
Constitutional AI	Principle-based self-improvement	Deployed at Anthropic	Scalable but requires good principles
Representation Engineering	Modify "refusal direction" in activation space	Research stage	Could enable robust, targeted refusals
Constitutional Classifiers	Input/output filters on synthetic data	Deployed Feb 2025	Blocks most jailbreaks with minimal over-refusal
Multi-model Systems	Separate classifier for harmful requests	Growing adoption	Adds latency but improves coverage
Activation Steering	Add/remove refusal vectors at inference	Proof-of-concept	Enables dynamic safety toggles

Representation Engineering: A Deeper Look

Recent research has discovered that refusal behavior in LLMs is mediated by a "refusal direction"—a single direction in the model's activation space. Arditi et al. (2024) showed this direction is:

Universal across safety-aligned languages: Works consistently across different linguistic contexts
Transferable between models: Steering vectors from source models can alter target model behaviors
Manipulable: Can be ablated (removing refusals) or amplified (inducing refusals)

This creates both opportunities (more robust safety mechanisms) and risks (adversaries could potentially ablate refusals if they have model access).

Timeline of Key Developments

Date	Event	Significance
2017	OpenAI introduces RLHF	Foundation for modern refusal training
2022	Anthropic publishes Constitutional AI	Principle-based safety training at scale
Nov 2022	ChatGPT launch	Mass deployment brings refusal training mainstream
2023	Safe RLHF published	Decouples helpfulness and harmlessness
2024	Refusal direction discovered	Shows refusal mediated by single activation direction
2024	OpenAI deploys Rule-Based Rewards	Reduces over-refusal in GPT-4o mini
2024	JailbreakBench launched	Standardized evaluation for jailbreaks
Feb 2025	Anthropic deploys Constitutional Classifiers	86% baseline jailbreak blocking
Mar-Apr 2025	UK AISI Gray Swan Challenge	22 models tested; every one broken

Sources & Resources

Foundational Research

Paper	Authors/Org	Year	Key Contribution
Constitutional AI: Harmlessness from AI Feedback	Anthropic	2022	Introduced principle-based AI self-improvement for safety
Safe RLHF: Safe Reinforcement Learning from Human Feedback	PKU-Alignment	2023	Decoupled helpfulness and harmlessness training (ICLR 2024)
Rule-Based Rewards for Language Model Safety	OpenAI	2024	Automated safety rewards reducing over-refusal
Constitutional Classifiers	Anthropic	2025	Input/output filters blocking 86% baseline jailbreaks
Refusal in Language Models Is Mediated by a Single Direction	Arditi et al.	2024	Discovered "refusal direction" in activation space (NeurIPS 2024)

Jailbreak Research and Benchmarks

Resource	Organization	Focus
JailbreakBench	Academic consortium	Standardized jailbreak evaluation
UK AISI Gray Swan Challenge	UK AI Safety Institute	Largest public red-teaming evaluation (2,000 participants)
OR-Bench: Over-Refusal Benchmark	Research	Measuring false positive refusals
Deceptive Delight	Palo Alto Networks	Multi-turn jailbreak techniques
AISI Frontier AI Trends Report	UK AISI	Universal jailbreaks found in all tested systems

Key Organizations

Organization	Role	Notable Work
Anthropic	Pioneer of Constitutional AI	RSP framework, Constitutional Classifiers
OpenAI	Rule-Based Rewards	GPT-4o mini safety stack
Gray Swan AI	Red-teaming platform	Arena for jailbreak testing
UK AI Safety Institute	Government evaluation	Frontier model assessments
Center for AI Safety	Research coordination	Safety benchmarks

Key Critiques

Consistently jailbroken: Every model tested in UK AISI challenge was broken; attacks transfer across models
Over-refusal problem: 12-43% of legitimate queries blocked; strong correlation with safety tuning
Doesn't address misalignment: Behavioral controls, not goal-level alignment
Arms race dynamic: Jailbreak discovery time is decreasing, but effort required for some models increasing

References

1Constitutional AI: Harmlessness from AI FeedbackarXiv·Yanuo Zhou·2025·Paper▸

★★★☆☆

arxiv.org

2Constitutional Classifiers: Defending Against Universal JailbreaksAnthropic▸

Anthropic introduces 'Constitutional Classifiers,' a defense mechanism using classifier models trained on a constitutional framework to detect and block universal jailbreak attempts against large language models. The approach aims to make AI systems robust against adversarial prompts that attempt to bypass safety measures systematically. The research demonstrates meaningful resistance to jailbreaks while maintaining model usefulness.

★★★★☆

anthropic.com

3JailbreakBench: LLM robustness benchmarkjailbreakbench.github.io▸

JailbreakBench provides a standardized, centralized benchmark for evaluating LLM robustness against jailbreak attacks. It includes a curated repository of attack artifacts, a consistent evaluation framework, and public leaderboards to enable reproducible comparison of attack and defense methods.

jailbreakbench.github.io

4AISI Frontier AI TrendsUK AI Safety Institute·Government▸

A UK AI Safety Institute government assessment documenting exponential performance improvements across frontier AI systems in multiple domains. The report evaluates emerging capabilities and associated risks, calling for robust safeguards as systems advance rapidly. It serves as an official benchmark of the current frontier AI landscape from a national safety authority.

★★★★☆

aisi.gov.uk

Refusal Training