Refusal Training
Refusal training teaches AI models to decline harmful requests rather than comply. While universally deployed and achieving 99%+ refusal rates on explicit violations, jailbreak techniques bypass defenses with 1.5-6.5% success rates, and over-refusal blocks 12-43% of legitimate queries.
Related
Related Pages
Top Related Pages
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...
Circuit Breakers / Inference Interventions
Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference.
RLHF
RLHF and Constitutional AI are the dominant techniques for aligning language models with human preferences.