Skip to content

Circuit Breakers / Inference Interventions

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:64 (Good)⚠️
Importance:74 (High)
Last edited:2026-01-28 (4 days ago)
Words:3.3k
Structure:
📊 22📈 5🔗 3📚 378%Score: 14/15
LLM Summary:Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with 1% capability loss, while Anthropic's Constitutional Classifiers block 95.6% of jailbreaks with 0.38% over-refusal increase. However, the UK AISI challenge found all 22 tested models eventually broken (62K/1.8M attempts succeeded), and novel token-forcing attacks achieve 25% success rates, highlighting fundamental limitations of reactive defenses.
Issues (2):
  • QualityRated 64 but structure suggests 93 (underrated by 29 points)
  • Links2 links could use <R> components

Circuit breakers represent a class of runtime interventions that can detect and stop harmful model behavior during inference, before outputs reach users or actions are executed. Unlike output filtering which operates on completed outputs, circuit breakers can intervene mid-generation, potentially stopping harm earlier in the process. This includes monitoring activation patterns, detecting emerging harmful content, and intervening when dangerous patterns are detected.

The approach draws inspiration from electrical circuit breakers that automatically interrupt dangerous current flows, and from software systems that halt operations when safety invariants are violated. For AI systems, circuit breakers can detect when a model is generating content that violates safety policies, when activation patterns suggest deceptive or manipulative intent, or when the system is attempting unauthorized actions.

Research organizations like Gray Swan AI have developed circuit breaker techniques that can reduce harmful outputs by modifying model behavior at inference time. Their 2024 paper “Improving Alignment and Robustness with Circuit Breakers” demonstrated that representation rerouting can reject harmful requests 87-90% of the time while preserving model capabilities. However, the approach faces fundamental limitations: it remains reactive rather than proactive, sophisticated models could potentially generate harm faster than circuit breakers can respond, and determined adversaries may find ways to trigger harmful outputs that evade detection. The UK AISI × Gray Swan Agent Red-Teaming Challenge (March-April 2025) tested 22 different LLMs with 1.8 million attack attempts, finding 62,000 successful breaks—demonstrating that no current frontier system fully resists determined, well-resourced attacks. Circuit breakers are a valuable last line of defense but should not substitute for addressing underlying alignment issues.

DimensionRatingAssessment
Safety UpliftMediumCan prevent harmful outputs in real-time; reactive defense
Capability UpliftTaxInterventions may reduce fluency or capability
Net World SafetyHelpfulValuable last line of defense; doesn’t address root causes
Lab IncentiveModeratePractical for deployment safety; some product impact
ScalabilityPartialWorks at scale; but sophisticated attacks may evade
Deception RobustnessWeakDeceptive model could generate harm before circuit breaks
SI ReadinessNoSI could reason around or disable circuit breakers
  • Current Investment: $10-30M/yr (Gray Swan, Anthropic, various labs)
  • Recommendation: Increase (practical near-term intervention; needs more research)
  • Differential Progress: Safety-leaning (primarily safety; some reliability benefits)

Different circuit breaker implementations offer varying tradeoffs between safety effectiveness, capability preservation, and computational cost. The following table compares major approaches based on published research and evaluations.

ApproachMechanismJailbreak Rejection RateCapability ImpactCompute OverheadLimitations
Representation Rerouting (RR)Redirects harmful internal representations to orthogonal space87-90%≈1% capability lossLow (≈5%)Vulnerable to novel token-forcing attacks
Constitutional ClassifiersInput/output filters trained on constitutional principles95.6% (from 86% baseline)0.38% increased refusal1-24% (improved over time)No universal jailbreaks found but specific attacks possible
Refusal Training (RLHF)Train model to refuse harmful requests directly40-70% (varies widely)Can reduce helpfulnessNone at inferenceHighly vulnerable to adversarial attacks
Adversarial TrainingTrain against known attack patterns60-80% on trained attacksMinorHigh during trainingPoor generalization to novel attacks
Activation ClampingModify activations when harmful patterns detected70-85%5-15% capability lossMediumRequires interpretability research
Output FilteringPost-generation content moderation50-70%NoneMediumCan be bypassed with encoded content

Sources: Gray Swan Circuit Breakers Paper, Anthropic Constitutional Classifiers, HarmBench

Quantified Effectiveness Against Attack Types

Section titled “Quantified Effectiveness Against Attack Types”

The following table shows measured attack success rates (lower is better for defense) across different defense methods when tested against standardized attack benchmarks.

Attack TypeNo DefenseRefusal TrainingCircuit Breakers (RR)Constitutional ClassifiersCombined Defense
Direct Harmful Requests95% ASR15-30% ASR10-13% ASR4.4% ASR2-5% ASR
GCG (Gradient-based)90% ASR60-80% ASR8-12% ASR5% ASR3-8% ASR
PAIR (LLM Optimizer)85% ASR40-60% ASR10-15% ASR6% ASR4-10% ASR
AutoDAN80% ASR50-70% ASR12-18% ASR7% ASR5-12% ASR
Human Jailbreaks75% ASR35-50% ASR15-20% ASR8% ASR6-15% ASR
Novel Token-Forcing90% ASR70-85% ASR25-40% ASRUnknown15-30% ASR

ASR = Attack Success Rate. Sources: HarmBench evaluations, Breaking Circuit Breakers, Constitutional Classifiers paper

Circuit breakers operate at inference time, monitoring and potentially intervening during model generation:

Loading diagram...
TypeMechanismUse CaseTradeoff
Hard StopHalt generation immediatelyClear policy violationMay truncate mid-sentence
Soft RedirectSteer generation away from harmEmerging concernMay produce awkward output
Activation ClampingModify internal activationsRepresentation-level interventionRequires interpretability
Token BlockingPrevent specific token generationKnown harmful patternsEasily circumvented
Probability ShiftingReduce likelihood of harmful continuationsSubtle steeringMay affect quality
MechanismWhat It DetectsSpeedAccuracy
Token-level scanningHarmful words/phrasesVery fastLow (easy to bypass)
Sequence classificationHarmful content patternsFastMedium
Activation analysisInternal state patternsMediumHigher potential
Semantic analysisMeaning/intent of contentSlowerHigher accuracy
Behavioral pattern matchingMulti-step harmful sequencesSlowestContext-dependent

Gray Swan AI, in collaboration with Carnegie Mellon University and the Center for AI Safety, has been a leader in circuit breaker research. Their June 2024 paper “Improving Alignment and Robustness with Circuit Breakers” (Zou et al.) introduced representation rerouting as a more robust alternative to refusal training.

TechniqueApproachQuantified ResultSource
Representation Rerouting (RR)Redirect harmful representations to orthogonal activation space87-90% harmful request rejection ratearXiv:2406.04313
Cygnet ModelLlama-3-8B-Instruct finetune with circuit breakers≈100x reduction in harmful outputs vs baselineGray Swan Research
Capability PreservationPareto-optimal safety/capability tradeoffOnly 1% dip in MT-Bench and MMLU scoresarXiv:2406.04313
UK AISI Red-TeamingLarge-scale adversarial evaluation62K successful breaks across 22 models from 1.8M attemptsGray Swan News

Technical Approach: Representation Rerouting

Section titled “Technical Approach: Representation Rerouting”

The circuit breaker method operates through four key steps:

  1. Identify harmful representations: Using contrastive activation pairs from harmful vs. safe prompts, identify the activation directions in the model’s internal representation space that correspond to harmful outputs

  2. Create intervention vectors: Develop orthogonal projection matrices that can reroute activations away from harmful regions while preserving the geometric structure needed for benign capabilities

  3. Apply at inference: Monitor residual stream activations at key layers (typically layers 8-24 in Llama-scale models) and apply the rerouting transformation when harmful patterns are detected

  4. Maintain capability: The orthogonal rerouting preserves distances and angles between non-harmful representations, enabling ~99% capability retention on benchmarks like MT-Bench and MMLU

Loading diagram...

More sophisticated circuit breakers operate at the activation level:

Loading diagram...
TargetDescriptionAdvantageChallenge
Residual streamMain information flowDirect impactMay disrupt coherence
Attention patternsWhat model focuses onCan redirect attentionComplex to interpret
MLP activationsFeature representationsFeature-level controlRequires interpretability
Layer outputsPer-layer representationsCan catch earlyNeed to know which layers

Research from Confirm Labs and other groups has identified significant weaknesses in current circuit breaker implementations. Understanding these limitations is essential for realistic assessment of the approach.

LimitationExplanationQuantified ImpactMitigation
ReactiveCan only respond to detected patternsN/A - architecturalBetter detection, faster response
Speed constraintsMust be faster than generation2-10ms per token overheadHardware optimization, early-layer detection
False positivesMay block legitimate content4% to 38.5% over-refusal increase on OR-BenchCalibration, soft interventions, constitutional classifiers
CircumventionNovel attacks evade detection25% ASR for novel token-forcing attacksContinuous updating, layered defenses
Capability taxInterventions may degrade quality1-15% capability loss depending on methodMinimal intervention approaches

Sources: Breaking Circuit Breakers, OR-Bench evaluations

Why Circuit Breakers Fail Against Sophisticated Adversaries

Section titled “Why Circuit Breakers Fail Against Sophisticated Adversaries”
Attack VectorHow It WorksWhy Breakers FailMeasured Success Rate
Novel token-forcingUse alternative completion patterns (e.g., “1. Choose the right…” instead of “Sure, here’s how…”)Circuit breakers trained on specific harmful prefixes don’t generalize25% ASR vs 2.5% for standard GCG
Distillation attacksWhite-box attacks targeting internal activations directlyCan break model even while requiring attack fluencyHigh (model-specific)
Gradual generationBuild harmful content slowly across many tokensEach individual step appears safe to detectors40-60% ASR
Encoded contentObfuscate harmful content with base64, pig latin, etc.Semantic analysis required at representation level30-50% ASR
Instruction hierarchy violationsHarm embedded in seemingly-safe task instructionsContext-dependent; hard to distinguish from legitimate use35-45% ASR

Source: Gray Swan Cygnet-8B was jailbroken in 3 hours despite impressive 0.0 scores in initial red-teaming evaluations

For circuit breakers to be effective, they must:

  1. Detect harmful patterns before harm occurs
  2. Intervene before output reaches user/system
  3. Not add prohibitive latency
ScenarioChallenge
Streaming outputsPartial output already delivered
Agentic actionsAction may be executed before detection
High throughputLatency costs multiply at scale
Sophisticated attacksMay generate harm faster than detection

Crux 1: Are Circuit Breakers a Meaningful Safety Measure?

Section titled “Crux 1: Are Circuit Breakers a Meaningful Safety Measure?”
Position: Yes (55% probability)Position: Limited Value (45% probability)
Last line of defense catches 87-95% of known attacksReactive defense; determined adversaries find bypasses
UK AISI challenge: only 3.4% success rate (62K/1.8M)Same challenge: every model eventually broken
Defense-in-depth reduces expected harmMay create false confidence in safety
Anthropic’s 3,000 red-teamer hours found no universal jailbreakFuture techniques may discover vulnerabilities

Key evidence that would update toward “Yes”: Circuit breakers remaining robust against novel attack classes for 2+ years. Key evidence for “Limited Value”: Automated jailbreak discovery that scales faster than defense updates.

Crux 2: Can Activation-Level Interventions Be Made Robust?

Section titled “Crux 2: Can Activation-Level Interventions Be Made Robust?”
Position: Promising (40% probability)Position: Fundamental Limits (60% probability)
Representation rerouting achieves 87-90% rejection without capability lossTheoretical results show quadratic helpfulness cost for guaranteed alignment
Constitutional classifiers + RR combined may exceed 95% effectivenessArms race dynamics: Cygnet-8B broken in 3 hours despite strong initial results
Interpretability research advancing rapidlyDeceptive models could potentially learn to hide harmful representations
Works on multimodal models25% ASR for novel token-forcing shows generalization limits

Key evidence that would update toward “Promising”: Formal verification methods for representation-level safety. Key evidence for “Fundamental Limits”: Systematic discovery of attacks that generalize across circuit breaker implementations.

Position: Worth It (65% probability)Position: Too High (35% probability)
Anthropic achieved 1% overhead with 95.6% protectionEarly implementations showed 24% overhead; user experience matters
0.38% over-refusal increase is negligible for most applications4% to 38.5% over-refusal on OR-Bench shows tradeoff can be severe
Enterprise customers increasingly demand safety certificationsCompetitive pressure may drive users to less safe alternatives
Tax decreasing with better research (1% vs 24% over 1 year)Some applications cannot tolerate any capability degradation

Key evidence that would update toward “Worth It”: Customer willingness-to-pay for verified safety. Key evidence for “Too High”: Significant user migration to unrestricted models.

Anthropic’s Constitutional Classifiers represent a complementary approach to circuit breakers, using input/output filtering rather than activation-level intervention. Their January 2025 paper demonstrated impressive results in a large-scale red-teaming evaluation.

MetricBaseline (No Defense)With Constitutional ClassifiersImprovement
Jailbreak Success Rate86%4.4%95% reduction
Over-refusal RateBaseline+0.38% (not statistically significant)Minimal impact
Compute OverheadN/A1% (improved from 24% in earlier version)96% cost reduction
Universal Jailbreaks FoundN/A0 (after 3,000+ red-teamer hours)None discovered

Source: Anthropic Constitutional Classifiers

Anthropic conducted a two-month red-teaming challenge with significant participation:

  • 185 active participants competed for $15,000 in rewards
  • 3,000+ hours of cumulative red-teaming effort
  • 560,000+ queries submitted across the full testing period
  • 1 “high-risk vulnerability” found (but no universal jailbreak)
  • Result: No participant discovered a single jailbreak that worked across all 10 forbidden query categories

This represents one of the most extensive public evaluations of an AI safety defense mechanism, though researchers acknowledge that future techniques may find vulnerabilities.

Loading diagram...
PrincipleImplementation
Fail-safeDefault to blocking in ambiguous cases
Minimal interventionSmallest change to prevent harm
Fast pathOptimize for low-latency common cases
AuditabilityLog all interventions for review
Graceful degradationHandle breaker failures safely
ConcernCalibration Strategy
Too many false positivesRaise detection thresholds, use soft interventions
Missing harmful contentLower thresholds, expand detection patterns
Latency too highOptimize detection, use progressive approaches
Capability degradationMinimize intervention strength, targeted modifications

Modern AI safety systems increasingly combine multiple circuit breaker approaches in a layered defense architecture. The following diagram illustrates how different mechanisms can work together.

Loading diagram...

This layered approach achieves better results than any single method:

  • Input classifiers catch 70-80% of obvious jailbreak attempts early
  • Activation monitoring catches 15-20% of remaining threats during generation
  • Output classifiers catch 5-10% that slip through earlier layers
  • Combined false positive rate remains below 5% when properly calibrated

Good fit if you believe:

  • Practical near-term interventions are valuable
  • Defense-in-depth is worth pursuing
  • Runtime safety can complement training
  • Incremental improvements help

Less relevant if you believe:

  • Sophisticated AI will always circumvent
  • Better to focus on alignment
  • Capability tax is unacceptable
  • Creates false sense of security
OrganizationApproachKey ResultsMaturity
Gray Swan AIRepresentation rerouting, red-teaming87-90% rejection rate; hosted UK AISI challenge with 1.8M attacksResearch leader
AnthropicConstitutional Classifiers + monitoring95.6% jailbreak blocking; 0.38% over-refusal increase; 1% compute overheadProduction deployment
OpenAIContent filtering, moderation APIIntegrated into GPT-4 and API productsProduction deployment
Cisco (Robust Intelligence)AI Firewall, algorithmic red-teamingAcquired October 2024 for enterprise AI securityEnterprise solutions
METR/ApolloThird-party evaluation protocolsIndependent safety assessmentEvaluation standards

Sources: Gray Swan Research, Anthropic Constitutional Classifiers, Cisco AI Defense

DirectionCurrent ProgressKey ChallengesEstimated Timeline
Faster detection2-10ms overhead achievedMaintaining accuracy at lower latencyOngoing
Activation-level interventionsRR demonstrated; probes developingRequires interpretability advances1-2 years
Adaptive breakersEarly researchLearning without creating vulnerabilities2-3 years
Minimal intervention1% capability tax achieved by AnthropicMaintaining safety at lower intervention strengthOngoing
Formal guaranteesTheoretical results showing quadratic helpfulness lossPractical guarantees remain elusive3-5+ years
Multimodal circuit breakersDemonstrated on vision-language modelsComplexity of cross-modal harmful content1-2 years

Sources: Representation Engineering review, Constitutional Classifiers++

PaperAuthorsKey ContributionLink
Improving Alignment and Robustness with Circuit BreakersZou et al. (Gray Swan, CMU, CAIS)Introduced representation rerouting; 87-90% rejection ratearXiv:2406.04313
Constitutional Classifiers: Defending against Universal JailbreaksAnthropic95.6% jailbreak blocking; 0.38% over-refusalAnthropic Research
Representation Engineering: A Top-Down Approach to AI TransparencyZou et al. (CAIS)Foundation for circuit breaker methodsCAIS Blog
Breaking Circuit BreakersConfirm LabsIdentified 25% ASR for novel token-forcing attacksConfirm Labs
HarmBench: A Standardized Evaluation FrameworkCAIS et al.Standardized red-teaming benchmarkGitHub
OrganizationFocusKey Resources
Gray Swan AICircuit breakers, red-teamingResearch Portal, Arena Platform
AnthropicConstitutional AI, safety classifiersConstitutional Classifiers
UK AI Safety InstituteGovernment evaluation, red-teaming partnershipsAISI Research
Center for AI SafetyHarmBench, representation engineering researchCAIS Research
JailbreakBenchStandardized jailbreak robustness benchmarkJailbreakBench
BenchmarkPurposeKey Metrics
HarmBenchStandardized red-teaming frameworkAttack Success Rate (ASR) across attack types
JailbreakBenchRobustness benchmark for jailbreakingLeaderboard rankings, reproducible attacks
OR-BenchOver-refusal evaluationFalse positive rate on harmless queries
MT-Bench / MMLUCapability preservationGeneral capability retention after safety training
  1. Reactive not proactive: Circuit breakers respond to detected patterns but don’t address root causes of misalignment—a sufficiently capable or deceptive model could generate harm before intervention triggers
  2. Adversarial arms race: Confirm Labs research showed Gray Swan’s Cygnet-8B was jailbroken in 3 hours despite impressive initial evaluations; no single defense is expected to remain robust indefinitely
  3. Capability-safety tradeoff: Theoretical results suggest alignment guarantees come at quadratic cost to helpfulness, potentially saturating at random guessing for strong interventions
  4. Open-source model gap: Circuit breakers require model modification; open-source models without safety training remain vulnerable, and fine-tuning can remove circuit breaker training

Circuit breakers affect the Ai Transition Model through:

ParameterImpact
Misuse PotentialCan catch some harmful outputs in real-time
Human Oversight QualityProvides automated enforcement of safety policies

Circuit breakers are a valuable addition to the AI safety toolkit, providing a last line of defense that can catch issues other measures miss. Current implementations achieve 87-95% effectiveness against known attacks with acceptable capability tradeoffs (1% overhead for Anthropic’s system). However, they should be understood as one layer in a defense-in-depth strategy, not a substitute for addressing fundamental alignment challenges. The adversarial arms race continues, with novel attacks regularly discovered that bypass existing defenses—reinforcing the need for ongoing research and layered approaches.