Circuit Breakers / Inference Interventions
Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with only 1% capability loss, while Anthropic's Constitutional Classifiers block 95.6% of jailbreaks. However, the UK AISI challenge found all 22 tested models could eventually be broken.
Related
Related Pages
Top Related Pages
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
AI Output Filtering
Output filtering screens AI outputs through classifiers before delivery to users.
Refusal Training
Refusal training teaches AI models to decline harmful requests rather than comply.
Adversarial Training
Adversarial training improves AI robustness by training models on examples designed to cause failures, including jailbreaks and prompt injections.
RLHF
RLHF and Constitutional AI are the dominant techniques for aligning language models with human preferences.