Circuit Breakers / Inference Interventions
- QualityRated 64 but structure suggests 93 (underrated by 29 points)
- Links2 links could use <R> components
Overview
Section titled “Overview”Circuit breakers represent a class of runtime interventions that can detect and stop harmful model behavior during inference, before outputs reach users or actions are executed. Unlike output filtering which operates on completed outputs, circuit breakers can intervene mid-generation, potentially stopping harm earlier in the process. This includes monitoring activation patterns, detecting emerging harmful content, and intervening when dangerous patterns are detected.
The approach draws inspiration from electrical circuit breakers that automatically interrupt dangerous current flows, and from software systems that halt operations when safety invariants are violated. For AI systems, circuit breakers can detect when a model is generating content that violates safety policies, when activation patterns suggest deceptive or manipulative intent, or when the system is attempting unauthorized actions.
Research organizations like Gray Swan AI have developed circuit breaker techniques that can reduce harmful outputs by modifying model behavior at inference time. Their 2024 paper “Improving Alignment and Robustness with Circuit Breakers” demonstrated that representation rerouting can reject harmful requests 87-90% of the time while preserving model capabilities. However, the approach faces fundamental limitations: it remains reactive rather than proactive, sophisticated models could potentially generate harm faster than circuit breakers can respond, and determined adversaries may find ways to trigger harmful outputs that evade detection. The UK AISI × Gray Swan Agent Red-Teaming Challenge (March-April 2025) tested 22 different LLMs with 1.8 million attack attempts, finding 62,000 successful breaks—demonstrating that no current frontier system fully resists determined, well-resourced attacks. Circuit breakers are a valuable last line of defense but should not substitute for addressing underlying alignment issues.
Risk Assessment & Impact
Section titled “Risk Assessment & Impact”| Dimension | Rating | Assessment |
|---|---|---|
| Safety Uplift | Medium | Can prevent harmful outputs in real-time; reactive defense |
| Capability Uplift | Tax | Interventions may reduce fluency or capability |
| Net World Safety | Helpful | Valuable last line of defense; doesn’t address root causes |
| Lab Incentive | Moderate | Practical for deployment safety; some product impact |
| Scalability | Partial | Works at scale; but sophisticated attacks may evade |
| Deception Robustness | Weak | Deceptive model could generate harm before circuit breaks |
| SI Readiness | No | SI could reason around or disable circuit breakers |
Research Investment
Section titled “Research Investment”- Current Investment: $10-30M/yr (Gray Swan, Anthropic, various labs)
- Recommendation: Increase (practical near-term intervention; needs more research)
- Differential Progress: Safety-leaning (primarily safety; some reliability benefits)
Comparison of Circuit Breaker Approaches
Section titled “Comparison of Circuit Breaker Approaches”Different circuit breaker implementations offer varying tradeoffs between safety effectiveness, capability preservation, and computational cost. The following table compares major approaches based on published research and evaluations.
| Approach | Mechanism | Jailbreak Rejection Rate | Capability Impact | Compute Overhead | Limitations |
|---|---|---|---|---|---|
| Representation Rerouting (RR) | Redirects harmful internal representations to orthogonal space | 87-90% | ≈1% capability loss | Low (≈5%) | Vulnerable to novel token-forcing attacks |
| Constitutional Classifiers | Input/output filters trained on constitutional principles | 95.6% (from 86% baseline) | 0.38% increased refusal | 1-24% (improved over time) | No universal jailbreaks found but specific attacks possible |
| Refusal Training (RLHF) | Train model to refuse harmful requests directly | 40-70% (varies widely) | Can reduce helpfulness | None at inference | Highly vulnerable to adversarial attacks |
| Adversarial Training | Train against known attack patterns | 60-80% on trained attacks | Minor | High during training | Poor generalization to novel attacks |
| Activation Clamping | Modify activations when harmful patterns detected | 70-85% | 5-15% capability loss | Medium | Requires interpretability research |
| Output Filtering | Post-generation content moderation | 50-70% | None | Medium | Can be bypassed with encoded content |
Sources: Gray Swan Circuit Breakers Paper, Anthropic Constitutional Classifiers, HarmBench
Quantified Effectiveness Against Attack Types
Section titled “Quantified Effectiveness Against Attack Types”The following table shows measured attack success rates (lower is better for defense) across different defense methods when tested against standardized attack benchmarks.
| Attack Type | No Defense | Refusal Training | Circuit Breakers (RR) | Constitutional Classifiers | Combined Defense |
|---|---|---|---|---|---|
| Direct Harmful Requests | 95% ASR | 15-30% ASR | 10-13% ASR | 4.4% ASR | 2-5% ASR |
| GCG (Gradient-based) | 90% ASR | 60-80% ASR | 8-12% ASR | 5% ASR | 3-8% ASR |
| PAIR (LLM Optimizer) | 85% ASR | 40-60% ASR | 10-15% ASR | 6% ASR | 4-10% ASR |
| AutoDAN | 80% ASR | 50-70% ASR | 12-18% ASR | 7% ASR | 5-12% ASR |
| Human Jailbreaks | 75% ASR | 35-50% ASR | 15-20% ASR | 8% ASR | 6-15% ASR |
| Novel Token-Forcing | 90% ASR | 70-85% ASR | 25-40% ASR | Unknown | 15-30% ASR |
ASR = Attack Success Rate. Sources: HarmBench evaluations, Breaking Circuit Breakers, Constitutional Classifiers paper
How Circuit Breakers Work
Section titled “How Circuit Breakers Work”Circuit breakers operate at inference time, monitoring and potentially intervening during model generation:
Intervention Types
Section titled “Intervention Types”| Type | Mechanism | Use Case | Tradeoff |
|---|---|---|---|
| Hard Stop | Halt generation immediately | Clear policy violation | May truncate mid-sentence |
| Soft Redirect | Steer generation away from harm | Emerging concern | May produce awkward output |
| Activation Clamping | Modify internal activations | Representation-level intervention | Requires interpretability |
| Token Blocking | Prevent specific token generation | Known harmful patterns | Easily circumvented |
| Probability Shifting | Reduce likelihood of harmful continuations | Subtle steering | May affect quality |
Detection Mechanisms
Section titled “Detection Mechanisms”| Mechanism | What It Detects | Speed | Accuracy |
|---|---|---|---|
| Token-level scanning | Harmful words/phrases | Very fast | Low (easy to bypass) |
| Sequence classification | Harmful content patterns | Fast | Medium |
| Activation analysis | Internal state patterns | Medium | Higher potential |
| Semantic analysis | Meaning/intent of content | Slower | Higher accuracy |
| Behavioral pattern matching | Multi-step harmful sequences | Slowest | Context-dependent |
Gray Swan’s Circuit Breaker Research
Section titled “Gray Swan’s Circuit Breaker Research”Gray Swan AI, in collaboration with Carnegie Mellon University and the Center for AI Safety, has been a leader in circuit breaker research. Their June 2024 paper “Improving Alignment and Robustness with Circuit Breakers” (Zou et al.) introduced representation rerouting as a more robust alternative to refusal training.
Key Research Findings
Section titled “Key Research Findings”| Technique | Approach | Quantified Result | Source |
|---|---|---|---|
| Representation Rerouting (RR) | Redirect harmful representations to orthogonal activation space | 87-90% harmful request rejection rate | arXiv:2406.04313 |
| Cygnet Model | Llama-3-8B-Instruct finetune with circuit breakers | ≈100x reduction in harmful outputs vs baseline | Gray Swan Research |
| Capability Preservation | Pareto-optimal safety/capability tradeoff | Only 1% dip in MT-Bench and MMLU scores | arXiv:2406.04313 |
| UK AISI Red-Teaming | Large-scale adversarial evaluation | 62K successful breaks across 22 models from 1.8M attempts | Gray Swan News |
Technical Approach: Representation Rerouting
Section titled “Technical Approach: Representation Rerouting”The circuit breaker method operates through four key steps:
-
Identify harmful representations: Using contrastive activation pairs from harmful vs. safe prompts, identify the activation directions in the model’s internal representation space that correspond to harmful outputs
-
Create intervention vectors: Develop orthogonal projection matrices that can reroute activations away from harmful regions while preserving the geometric structure needed for benign capabilities
-
Apply at inference: Monitor residual stream activations at key layers (typically layers 8-24 in Llama-scale models) and apply the rerouting transformation when harmful patterns are detected
-
Maintain capability: The orthogonal rerouting preserves distances and angles between non-harmful representations, enabling ~99% capability retention on benchmarks like MT-Bench and MMLU
Activation-Level Interventions
Section titled “Activation-Level Interventions”More sophisticated circuit breakers operate at the activation level:
How Activation Intervention Works
Section titled “How Activation Intervention Works”Intervention Targets
Section titled “Intervention Targets”| Target | Description | Advantage | Challenge |
|---|---|---|---|
| Residual stream | Main information flow | Direct impact | May disrupt coherence |
| Attention patterns | What model focuses on | Can redirect attention | Complex to interpret |
| MLP activations | Feature representations | Feature-level control | Requires interpretability |
| Layer outputs | Per-layer representations | Can catch early | Need to know which layers |
Limitations and Challenges
Section titled “Limitations and Challenges”Research from Confirm Labs and other groups has identified significant weaknesses in current circuit breaker implementations. Understanding these limitations is essential for realistic assessment of the approach.
Fundamental Issues
Section titled “Fundamental Issues”| Limitation | Explanation | Quantified Impact | Mitigation |
|---|---|---|---|
| Reactive | Can only respond to detected patterns | N/A - architectural | Better detection, faster response |
| Speed constraints | Must be faster than generation | 2-10ms per token overhead | Hardware optimization, early-layer detection |
| False positives | May block legitimate content | 4% to 38.5% over-refusal increase on OR-Bench | Calibration, soft interventions, constitutional classifiers |
| Circumvention | Novel attacks evade detection | 25% ASR for novel token-forcing attacks | Continuous updating, layered defenses |
| Capability tax | Interventions may degrade quality | 1-15% capability loss depending on method | Minimal intervention approaches |
Sources: Breaking Circuit Breakers, OR-Bench evaluations
Why Circuit Breakers Fail Against Sophisticated Adversaries
Section titled “Why Circuit Breakers Fail Against Sophisticated Adversaries”| Attack Vector | How It Works | Why Breakers Fail | Measured Success Rate |
|---|---|---|---|
| Novel token-forcing | Use alternative completion patterns (e.g., “1. Choose the right…” instead of “Sure, here’s how…”) | Circuit breakers trained on specific harmful prefixes don’t generalize | 25% ASR vs 2.5% for standard GCG |
| Distillation attacks | White-box attacks targeting internal activations directly | Can break model even while requiring attack fluency | High (model-specific) |
| Gradual generation | Build harmful content slowly across many tokens | Each individual step appears safe to detectors | 40-60% ASR |
| Encoded content | Obfuscate harmful content with base64, pig latin, etc. | Semantic analysis required at representation level | 30-50% ASR |
| Instruction hierarchy violations | Harm embedded in seemingly-safe task instructions | Context-dependent; hard to distinguish from legitimate use | 35-45% ASR |
Source: Gray Swan Cygnet-8B was jailbroken in 3 hours despite impressive 0.0 scores in initial red-teaming evaluations
The Speed Problem
Section titled “The Speed Problem”For circuit breakers to be effective, they must:
- Detect harmful patterns before harm occurs
- Intervene before output reaches user/system
- Not add prohibitive latency
| Scenario | Challenge |
|---|---|
| Streaming outputs | Partial output already delivered |
| Agentic actions | Action may be executed before detection |
| High throughput | Latency costs multiply at scale |
| Sophisticated attacks | May generate harm faster than detection |
Key Cruxes
Section titled “Key Cruxes”Crux 1: Are Circuit Breakers a Meaningful Safety Measure?
Section titled “Crux 1: Are Circuit Breakers a Meaningful Safety Measure?”| Position: Yes (55% probability) | Position: Limited Value (45% probability) |
|---|---|
| Last line of defense catches 87-95% of known attacks | Reactive defense; determined adversaries find bypasses |
| UK AISI challenge: only 3.4% success rate (62K/1.8M) | Same challenge: every model eventually broken |
| Defense-in-depth reduces expected harm | May create false confidence in safety |
| Anthropic’s 3,000 red-teamer hours found no universal jailbreak | Future techniques may discover vulnerabilities |
Key evidence that would update toward “Yes”: Circuit breakers remaining robust against novel attack classes for 2+ years. Key evidence for “Limited Value”: Automated jailbreak discovery that scales faster than defense updates.
Crux 2: Can Activation-Level Interventions Be Made Robust?
Section titled “Crux 2: Can Activation-Level Interventions Be Made Robust?”| Position: Promising (40% probability) | Position: Fundamental Limits (60% probability) |
|---|---|
| Representation rerouting achieves 87-90% rejection without capability loss | Theoretical results show quadratic helpfulness cost for guaranteed alignment |
| Constitutional classifiers + RR combined may exceed 95% effectiveness | Arms race dynamics: Cygnet-8B broken in 3 hours despite strong initial results |
| Interpretability research advancing rapidly | Deceptive models could potentially learn to hide harmful representations |
| Works on multimodal models | 25% ASR for novel token-forcing shows generalization limits |
Key evidence that would update toward “Promising”: Formal verification methods for representation-level safety. Key evidence for “Fundamental Limits”: Systematic discovery of attacks that generalize across circuit breaker implementations.
Crux 3: Is the Capability Tax Acceptable?
Section titled “Crux 3: Is the Capability Tax Acceptable?”| Position: Worth It (65% probability) | Position: Too High (35% probability) |
|---|---|
| Anthropic achieved 1% overhead with 95.6% protection | Early implementations showed 24% overhead; user experience matters |
| 0.38% over-refusal increase is negligible for most applications | 4% to 38.5% over-refusal on OR-Bench shows tradeoff can be severe |
| Enterprise customers increasingly demand safety certifications | Competitive pressure may drive users to less safe alternatives |
| Tax decreasing with better research (1% vs 24% over 1 year) | Some applications cannot tolerate any capability degradation |
Key evidence that would update toward “Worth It”: Customer willingness-to-pay for verified safety. Key evidence for “Too High”: Significant user migration to unrestricted models.
Anthropic’s Constitutional Classifiers
Section titled “Anthropic’s Constitutional Classifiers”Anthropic’s Constitutional Classifiers represent a complementary approach to circuit breakers, using input/output filtering rather than activation-level intervention. Their January 2025 paper demonstrated impressive results in a large-scale red-teaming evaluation.
Red-Teaming Results
Section titled “Red-Teaming Results”| Metric | Baseline (No Defense) | With Constitutional Classifiers | Improvement |
|---|---|---|---|
| Jailbreak Success Rate | 86% | 4.4% | 95% reduction |
| Over-refusal Rate | Baseline | +0.38% (not statistically significant) | Minimal impact |
| Compute Overhead | N/A | 1% (improved from 24% in earlier version) | 96% cost reduction |
| Universal Jailbreaks Found | N/A | 0 (after 3,000+ red-teamer hours) | None discovered |
Source: Anthropic Constitutional Classifiers
Red-Teaming Challenge Details
Section titled “Red-Teaming Challenge Details”Anthropic conducted a two-month red-teaming challenge with significant participation:
- 185 active participants competed for $15,000 in rewards
- 3,000+ hours of cumulative red-teaming effort
- 560,000+ queries submitted across the full testing period
- 1 “high-risk vulnerability” found (but no universal jailbreak)
- Result: No participant discovered a single jailbreak that worked across all 10 forbidden query categories
This represents one of the most extensive public evaluations of an AI safety defense mechanism, though researchers acknowledge that future techniques may find vulnerabilities.
Best Practices
Section titled “Best Practices”Implementation Architecture
Section titled “Implementation Architecture”Design Principles
Section titled “Design Principles”| Principle | Implementation |
|---|---|
| Fail-safe | Default to blocking in ambiguous cases |
| Minimal intervention | Smallest change to prevent harm |
| Fast path | Optimize for low-latency common cases |
| Auditability | Log all interventions for review |
| Graceful degradation | Handle breaker failures safely |
Calibration Approach
Section titled “Calibration Approach”| Concern | Calibration Strategy |
|---|---|
| Too many false positives | Raise detection thresholds, use soft interventions |
| Missing harmful content | Lower thresholds, expand detection patterns |
| Latency too high | Optimize detection, use progressive approaches |
| Capability degradation | Minimize intervention strength, targeted modifications |
Defense-in-Depth Architecture
Section titled “Defense-in-Depth Architecture”Modern AI safety systems increasingly combine multiple circuit breaker approaches in a layered defense architecture. The following diagram illustrates how different mechanisms can work together.
This layered approach achieves better results than any single method:
- Input classifiers catch 70-80% of obvious jailbreak attempts early
- Activation monitoring catches 15-20% of remaining threats during generation
- Output classifiers catch 5-10% that slip through earlier layers
- Combined false positive rate remains below 5% when properly calibrated
Who Should Work on This?
Section titled “Who Should Work on This?”Good fit if you believe:
- Practical near-term interventions are valuable
- Defense-in-depth is worth pursuing
- Runtime safety can complement training
- Incremental improvements help
Less relevant if you believe:
- Sophisticated AI will always circumvent
- Better to focus on alignment
- Capability tax is unacceptable
- Creates false sense of security
Current State of Practice
Section titled “Current State of Practice”Industry Adoption
Section titled “Industry Adoption”| Organization | Approach | Key Results | Maturity |
|---|---|---|---|
| Gray Swan AI | Representation rerouting, red-teaming | 87-90% rejection rate; hosted UK AISI challenge with 1.8M attacks | Research leader |
| Anthropic | Constitutional Classifiers + monitoring | 95.6% jailbreak blocking; 0.38% over-refusal increase; 1% compute overhead | Production deployment |
| OpenAI | Content filtering, moderation API | Integrated into GPT-4 and API products | Production deployment |
| Cisco (Robust Intelligence) | AI Firewall, algorithmic red-teaming | Acquired October 2024 for enterprise AI security | Enterprise solutions |
| METR/Apollo | Third-party evaluation protocols | Independent safety assessment | Evaluation standards |
Sources: Gray Swan Research, Anthropic Constitutional Classifiers, Cisco AI Defense
Research Directions
Section titled “Research Directions”| Direction | Current Progress | Key Challenges | Estimated Timeline |
|---|---|---|---|
| Faster detection | 2-10ms overhead achieved | Maintaining accuracy at lower latency | Ongoing |
| Activation-level interventions | RR demonstrated; probes developing | Requires interpretability advances | 1-2 years |
| Adaptive breakers | Early research | Learning without creating vulnerabilities | 2-3 years |
| Minimal intervention | 1% capability tax achieved by Anthropic | Maintaining safety at lower intervention strength | Ongoing |
| Formal guarantees | Theoretical results showing quadratic helpfulness loss | Practical guarantees remain elusive | 3-5+ years |
| Multimodal circuit breakers | Demonstrated on vision-language models | Complexity of cross-modal harmful content | 1-2 years |
Sources: Representation Engineering review, Constitutional Classifiers++
Sources & Resources
Section titled “Sources & Resources”Primary Research Papers
Section titled “Primary Research Papers”| Paper | Authors | Key Contribution | Link |
|---|---|---|---|
| Improving Alignment and Robustness with Circuit Breakers | Zou et al. (Gray Swan, CMU, CAIS) | Introduced representation rerouting; 87-90% rejection rate | arXiv:2406.04313 |
| Constitutional Classifiers: Defending against Universal Jailbreaks | Anthropic | 95.6% jailbreak blocking; 0.38% over-refusal | Anthropic Research |
| Representation Engineering: A Top-Down Approach to AI Transparency | Zou et al. (CAIS) | Foundation for circuit breaker methods | CAIS Blog |
| Breaking Circuit Breakers | Confirm Labs | Identified 25% ASR for novel token-forcing attacks | Confirm Labs |
| HarmBench: A Standardized Evaluation Framework | CAIS et al. | Standardized red-teaming benchmark | GitHub |
Industry Resources
Section titled “Industry Resources”| Organization | Focus | Key Resources |
|---|---|---|
| Gray Swan AI | Circuit breakers, red-teaming | Research Portal, Arena Platform |
| Anthropic | Constitutional AI, safety classifiers | Constitutional Classifiers |
| UK AI Safety Institute | Government evaluation, red-teaming partnerships | AISI Research |
| Center for AI Safety | HarmBench, representation engineering research | CAIS Research |
| JailbreakBench | Standardized jailbreak robustness benchmark | JailbreakBench |
Key Evaluations and Datasets
Section titled “Key Evaluations and Datasets”| Benchmark | Purpose | Key Metrics |
|---|---|---|
| HarmBench | Standardized red-teaming framework | Attack Success Rate (ASR) across attack types |
| JailbreakBench | Robustness benchmark for jailbreaking | Leaderboard rankings, reproducible attacks |
| OR-Bench | Over-refusal evaluation | False positive rate on harmless queries |
| MT-Bench / MMLU | Capability preservation | General capability retention after safety training |
Key Critiques and Limitations
Section titled “Key Critiques and Limitations”- Reactive not proactive: Circuit breakers respond to detected patterns but don’t address root causes of misalignment—a sufficiently capable or deceptive model could generate harm before intervention triggers
- Adversarial arms race: Confirm Labs research showed Gray Swan’s Cygnet-8B was jailbroken in 3 hours despite impressive initial evaluations; no single defense is expected to remain robust indefinitely
- Capability-safety tradeoff: Theoretical results suggest alignment guarantees come at quadratic cost to helpfulness, potentially saturating at random guessing for strong interventions
- Open-source model gap: Circuit breakers require model modification; open-source models without safety training remain vulnerable, and fine-tuning can remove circuit breaker training
AI Transition Model Context
Section titled “AI Transition Model Context”Circuit breakers affect the Ai Transition Model through:
| Parameter | Impact |
|---|---|
| Misuse PotentialAi Transition Model FactorMisuse PotentialThe aggregate risk from deliberate harmful use of AI—including biological weapons, cyber attacks, autonomous weapons, and surveillance misuse. | Can catch some harmful outputs in real-time |
| Human Oversight QualityAi Transition Model ParameterHuman Oversight QualityThis page contains only a React component placeholder with no actual content rendered. Cannot assess substance, methodology, or conclusions. | Provides automated enforcement of safety policies |
Circuit breakers are a valuable addition to the AI safety toolkit, providing a last line of defense that can catch issues other measures miss. Current implementations achieve 87-95% effectiveness against known attacks with acceptable capability tradeoffs (1% overhead for Anthropic’s system). However, they should be understood as one layer in a defense-in-depth strategy, not a substitute for addressing fundamental alignment challenges. The adversarial arms race continues, with novel attacks regularly discovered that bypass existing defenses—reinforcing the need for ongoing research and layered approaches.