Longterm Wiki

Refusal Training

Refusal training teaches AI models to decline harmful requests rather than comply. While universally deployed and achieving 99%+ refusal rates on explicit violations, jailbreak techniques bypass defenses with 1.5-6.5% success rates, and over-refusal blocks 12-43% of legitimate queries.

Related

Related Pages

Top Related Pages

Risks

Bioweapons RiskSchemingGoal MisgeneralizationReward HackingSycophancy

Analysis

Reward Hacking Taxonomy and Severity Model

Approaches

Capability Unlearning / RemovalOpen Source AI SafetyAI AlignmentAI Output FilteringPreference Optimization MethodsAI-Assisted Alignment

Organizations

Elicit (AI Research Tool)Google DeepMind

Concepts

Alignment Training OverviewLarge Language Models

Key Debates

Technical AI Safety Research

Other

Dario AmodeiJan Leike

Tags

refusal-trainingjailbreakingsafety-trainingrlhfover-refusalmisuse-prevention