Longterm Wiki

Circuit Breakers / Inference Interventions

Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with only 1% capability loss, while Anthropic's Constitutional Classifiers block 95.6% of jailbreaks. However, the UK AISI challenge found all 22 tested models could eventually be broken.

Related

Related Pages

Top Related Pages

Analysis

AI Safety Intervention Effectiveness Matrix

Approaches

Sandboxing / ContainmentTool-Use RestrictionsRepresentation EngineeringSparse Autoencoders (SAEs)Structured Access / API-Only

Concepts

Alignment Interpretability Overview

Organizations

FAR AIRedwood Research

Tags

runtime-safetyinference-interventionjailbreak-defenseadversarial-robustnessdefense-in-depth