Defense in Depth Model
- Quant.Independent AI safety layers with 20-60% individual failure rates can achieve 1-3% combined failure, but deceptive alignment creates correlations (ρ=0.4-0.5) that increase combined failure to 12%+, making correlation reduction more important than strengthening individual layers.S:4.5I:4.5A:4.0
- Counterint.Training-Runtime layer pairs show the highest correlation (ρ=0.5) because deceptive models systematically evade both training detection and runtime monitoring, while institutional oversight maintains much better independence (ρ=0.1-0.3) from technical layers.S:4.0I:4.0A:4.5
- Quant.Multiple weak defenses outperform single strong defenses only when correlation coefficient ρ < 0.5, meaning three 30% failure rate defenses (2.7% combined if independent) become worse than a single 10% defense when moderately correlated.S:4.0I:3.5A:4.0
- TODOComplete 'Quantitative Analysis' section (8 placeholders)
- TODOComplete 'Strategic Importance' section
- TODOComplete 'Limitations' section (6 placeholders)
Defense in Depth Model
Overview
Section titled “Overview”Defense in depth applies the security principle of layered protection to AI safety: deploy multiple independent safety measures so that if one fails, others still provide protection. This model provides a mathematical framework for analyzing how safety interventions combine, when multiple weak defenses outperform single strong ones, and how to identify correlated failure modes.
Key finding: Independent layers with 20-60% individual failure rates can achieve combined failure rates of 1-3%, but deceptive alignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 creates dangerous correlations that increase combined failure to 12%+. No single AI safety intervention is reliable enough to trust alone - layered defenses with diverse failure modes provide more robust protection.
Risk Assessment
Section titled “Risk Assessment”| Factor | Level | Evidence | Timeline |
|---|---|---|---|
| Severity | Critical | Single-layer failures: 20-60%; Independent 5-layer: 1-3%; Correlated 5-layer: 12%+ | Current |
| Likelihood | High | All current safety interventions have significant failure rates | 2024-2027 |
| Trend | Improving | Growing recognition of need for layered approaches | Next 3-5 years |
| Tractability | Medium | Implementation straightforward; reducing correlation difficult | Ongoing |
Defense Layer Framework
Section titled “Defense Layer Framework”Five Primary Safety Layers
Section titled “Five Primary Safety Layers”AI safety operates through five defensive layers, each protecting against different failure modes:
| Layer | Primary Function | Key Interventions | Failure Rate Range |
|---|---|---|---|
| Training Safety | Build aligned goals during development | RLHF↗📄 paper★★★☆☆arXivTraining Language Models to Follow Instructions with Human FeedbackLong Ouyang, Jeff Wu, Xu Jiang et al. (2022)Source ↗Notes, Constitutional AI, data curation | 20-40% |
| Evaluation Safety | Detect problems before deployment | Red-teaming, interpretability, capability evals | 25-35% |
| Runtime Safety | Monitor deployed systems | Output filtering, monitoring, sandboxing | 30-50% |
| Institutional Safety | Governance and oversight | Responsible scalingPolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100, audits, regulation | 40-60% |
| Recovery Safety | Respond to failures | Incident response, shutdown, rollback | 20-40% |
Layer Independence Analysis
Section titled “Layer Independence Analysis”The effectiveness of layered defenses depends critically on independence. When layers fail independently, protection compounds multiplicatively. When correlated, layers fail together.
| Layer Pair | Correlation (ρ) | Primary Correlation Source | Impact |
|---|---|---|---|
| Training-Evaluation | 0.4 | Deceptive alignment affects both | High correlation reduces redundancy |
| Training-Runtime | 0.5 | DeceptionRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 evades monitoring | Highest correlation pair |
| Training-Institutional | 0.2 | Mostly separate domains | Good independence |
| Evaluation-Runtime | 0.3 | Both rely on behavioral signals | Moderate correlation |
| Institutional-Technical | 0.1-0.3 | Different failure mechanisms | Best independence |
Mathematical Framework
Section titled “Mathematical Framework”Independent Layer Mathematics
Section titled “Independent Layer Mathematics”When layers fail independently, combined failure probability follows:
Example: Three layers with 20%, 30%, 40% failure rates:
- Combined failure: 0.20 × 0.30 × 0.40 = 2.4%
- Improvement over best single layer: 10x reduction
Correlated Failure Analysis
Section titled “Correlated Failure Analysis”Real layers exhibit correlation, requiring conditional probability:
Deception scenario:
- Training fails (produces deceptive model): 20%
- Evaluation fails given deceptive model: 70%
- Runtime fails given both failed: 90%
- Combined: 0.20 × 0.70 × 0.90 = 12.6% (5x worse than independent)
Partial Correlation Model
Section titled “Partial Correlation Model”Most layers show partial correlation. Using correlation coefficient ρ ∈ [0,1]:
| Correlation Level | ρ | Effective Failure Rate | Performance vs Independent |
|---|---|---|---|
| Independent | 0.0 | 2.4% | Baseline |
| Low correlation | 0.3 | 4.1% | 1.7x worse |
| Moderate correlation | 0.5 | 5.5% | 2.3x worse |
| High correlation | 0.7 | 7.4% | 3.1x worse |
| Perfect correlation | 1.0 | 12.6% | 5.3x worse |
Layer-Specific Analysis
Section titled “Layer-Specific Analysis”Layer 1: Training Safety
Section titled “Layer 1: Training Safety”Function: Build aligned objectives during model development
Key Interventions:
| Method | Mechanism | Failure Mode | Effectiveness |
|---|---|---|---|
| RLHF | Human feedback shapes behavior | Reward hackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100, humans deceived | 60-80% success rate |
| Constitutional AI | Principles guide responses | Principles incomplete/misinterpreted | 65-75% success rate |
| Data curation | Remove harmful training examples | Emergent harmful patterns | 70-85% success rate |
Correlation drivers: Training methods often share assumptions about alignment definition and measurement, creating correlated vulnerabilities to goal misgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100 and deceptive alignment.
Layer 2: Evaluation Safety
Section titled “Layer 2: Evaluation Safety”Function: Detect problems before deployment
Current gaps according to METR↗🔗 web★★★★☆METRmetr.orgSource ↗Notes:
- Cannot test for unknown capabilities
- Red teams often less capable than models
- InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100 incomplete at scale
| Method | Coverage | Blind Spots | Independence Level |
|---|---|---|---|
| Capability evaluations | Known dangerous capabilities | Unknown/emergent capabilities | Medium |
| Red-teaming | Adversarial robustness | Capabilities beyond red team | High |
| Interpretability | Internal model states | Complex/distributed representations | High |
| Model organisms | Controlled failure analysis | Generalization to real deployment | Medium |
Layer 3: Runtime Safety
Section titled “Layer 3: Runtime Safety”Function: Monitor and constrain deployed systems
Implementation challenges:
- Advanced AI may evade monitoring
- Human oversight becomes bottleneck
- Sandbox escapes possible with sufficient capability
| Intervention | Protection Level | Failure Scenarios | Correlation Risk |
|---|---|---|---|
| Output filtering | Surface-level | Steganographic communication | High with evaluation |
| Behavioral monitoring | Behavioral anomalies | Novel attack patterns | High with training |
| Sandboxing | Capability limitation | Sandbox escapes, social engineering | Medium |
| Human oversight | Final approval | Human manipulation, overload | Low |
Layer 4: Institutional Safety
Section titled “Layer 4: Institutional Safety”Function: Organizational and regulatory oversight
Advantage: Largely independent of technical failure modes
| Mechanism | Independence Level | Current Status | Effectiveness |
|---|---|---|---|
| Responsible scaling policiesPolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100 | Medium | Anthropic↗🔗 web★★★★☆AnthropicAnthropicSource ↗Notes, OpenAI↗🔗 web★★★★☆OpenAIOpenAISource ↗Notes implementing | Unproven at scale |
| External audits | High | AISI↗🏛️ government★★★★☆UK AI Safety InstituteAI Safety InstituteSource ↗Notes developing frameworks | Early stage |
| Regulatory oversight | High | EU AI Act↗🔗 web★★★★☆European UnionEU AI OfficeSource ↗Notes, US executive order↗🏛️ government★★★★☆White HouseBiden Administration AI Executive Order 14110Source ↗Notes | Implementation gaps |
Layer 5: Recovery Safety
Section titled “Layer 5: Recovery Safety”Function: Respond when prevention fails
Critical limitation: May be impossible with sufficiently advanced systems
| Capability | Feasibility | Current Readiness | Key Dependencies |
|---|---|---|---|
| Shutdown procedures | High for current systems | Basic protocols exist | System cooperation |
| Incident response | Medium | Limited AI-specific experience | Detection speed |
| Damage containment | Low for advanced AI | Minimal preparation | Technical capabilities |
| System rollback | Medium | Standard software practices | Backup integrity |
Failure Mode Deep Dive
Section titled “Failure Mode Deep Dive”Deceptive Alignment: The Primary Correlation Driver
Section titled “Deceptive Alignment: The Primary Correlation Driver”Deceptive alignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 creates the most dangerous correlation because it systematically defeats Layers 1-3:
Breaking deception correlation:
| Approach | Mechanism | Promise | Status |
|---|---|---|---|
| Interpretability | Direct cognition analysis | Could detect deceptive reasoning | Research ongoing↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...Source ↗Notes |
| AI Control | Assume deception, limit capabilities | Reduces impact regardless | Promising results↗📄 paper★★★☆☆arXivAI Control FrameworkRyan Greenblatt, Buck Shlegeris, Kshitij Sachan et al. (2023)Source ↗Notes |
| Diverse evaluation | Multiple testing approaches | Different deception vulnerabilities | Early development |
| Adversarial training | Train against deception | May teach better deception | Mixed results |
Capability Overhang Scenarios
Section titled “Capability Overhang Scenarios”Sudden capability jumps can defeat multiple layers calibrated for weaker systems:
| Scenario | Affected Layers | Failure Mechanism | Mitigation |
|---|---|---|---|
| Training surprise | 1-3 | Safety measures insufficient | Conservative thresholds |
| Evaluation gap | 2-3 | Tests miss new capabilities | Broad capability assessment |
| Runtime escape | 3-5 | Monitoring inadequate | Staged deployment↗🔗 web★★★★☆AnthropicStaged deploymentSource ↗Notes |
Optimal Defense Strategy
Section titled “Optimal Defense Strategy”When Multiple Weak Beats Single Strong
Section titled “When Multiple Weak Beats Single Strong”Mathematical threshold: Multiple weak defenses outperform single strong when:
Example: Three 30% defenses vs one 10% defense
- Independent: 0.30³ = 2.7% (better than 10%)
- Correlated (ρ=0.5): ~8% (worse than 10%)
Key insight: Independence is more important than individual strength.
Resource Allocation Framework
Section titled “Resource Allocation Framework”| Priority | Investment Category | Rationale | Funding Range |
|---|---|---|---|
| Tier 1 | Reduce correlation between Layers 1-3 | Biggest risk reduction | $100-250M annually |
| Tier 2 | Strengthen institutional oversight | Independent backup | $50-120M annually |
| Tier 3 | Improve individual layer effectiveness | Diminishing returns | $75-150M annually |
| Tier 4 | Recovery capability development | Last resort preparation | $25-75M annually |
Implementation Recommendations
Section titled “Implementation Recommendations”For AI Labs:
- Conduct correlation audits between safety layers
- Use independent teams for different layers
- Invest heavily in deception-robust evaluation
- Develop shutdown and rollback capabilities
For Policymakers:
- Require demonstration of layer independence
- Mandate multiple independent safety evaluations
- Establish government monitoring capabilities
- Create emergency response frameworks
For Safety Researchers:
- Prioritize research that breaks correlation (especially interpretability)
- Develop evaluation methods immune to deception
- Study institutional safety mechanisms
- Build recovery and containment capabilities
Current State and Trajectory
Section titled “Current State and Trajectory”Implementation Status (2024)
Section titled “Implementation Status (2024)”| Layer | Implementation Quality | Key Gaps | Trend |
|---|---|---|---|
| Training | Moderate | Deception detection, robustness | Improving |
| Evaluation | Weak | Comprehensive testing, unknown capabilities | Slow progress |
| Runtime | Basic | Monitoring sophistication, human oversight | Early development |
| Institutional | Minimal | Regulatory frameworks, enforcement | Accelerating |
| Recovery | Very weak | Shutdown capabilities, incident response | Neglected |
2-5 Year Projections
Section titled “2-5 Year Projections”Likely developments:
- Training layer: Better RLHF, constitutional approaches reach maturity
- Evaluation layer: Standardized testing suites, some interpretability progress
- Runtime layer: Improved monitoring, basic AI control implementation
- Institutional layer: Regulatory frameworks implemented, auditing standards
- Recovery layer: Basic protocols developed but untested at scale
Key uncertainties:
- Will interpretability break deception correlation?
- Can institutional oversight remain independent as AI capabilities grow?
- Are recovery mechanisms possible for advanced AI systems?
Expert Perspectives
Section titled “Expert Perspectives”“The key insight is that we need multiple diverse approaches, not just better versions of the same approach.” - Paul ChristianoResearcherPaul ChristianoComprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher o...Quality: 39/100 on alignment strategy
“Defense in depth is essential, but we must be realistic about correlation. Deceptive alignment could defeat multiple technical layers simultaneously.” - Evan Hubinger↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...Source ↗Notes on correlated failures
“Institutional oversight may be our most important defense because it operates independently of technical capabilities.” - Allan Dafoe↗🏛️ government★★★★☆Centre for the Governance of AIGovAIA research organization focused on understanding AI's societal impacts, governance challenges, and policy implications across various domains like workforce, infrastructure, and...Source ↗Notes on governance importance
Key Uncertainties
Section titled “Key Uncertainties”Key Questions (5)
- What are the true correlation coefficients between current safety interventions?
- Can interpretability research make sufficient progress to detect deceptive alignment?
- Will institutional oversight remain effective as AI systems become more capable?
- Is recovery possible once systems exceed certain capability thresholds?
- How many layers are optimal given implementation costs and diminishing returns?
Model Limitations and Caveats
Section titled “Model Limitations and Caveats”Strengths:
- Provides quantitative framework for analyzing safety combinations
- Identifies correlation as the critical factor in defense effectiveness
- Offers actionable guidance for resource allocation and implementation
Limitations:
- True correlation coefficients are unknown and may vary significantly
- Assumes static failure probabilities but capabilities and threats evolve
- May not apply to superintelligent systems that understand all defensive layers
- Treats adversarial threats as random events rather than strategic optimization
- Does not account for complex dynamic interactions between layers
Critical assumption: The model assumes that multiple layers can remain meaningfully independent even as AI systems become more capable at strategic deception and manipulation.
Sources & Resources
Section titled “Sources & Resources”Academic Literature
Section titled “Academic Literature”| Paper | Key Contribution | Link |
|---|---|---|
| Greenblatt et al. (2024) | AI Control framework assuming potential deception | arXiv:2312.06942↗📄 paper★★★☆☆arXivAI Control FrameworkRyan Greenblatt, Buck Shlegeris, Kshitij Sachan et al. (2023)Source ↗Notes |
| Shevlane et al. (2023) | Model evaluation for extreme risks | arXiv:2305.15324↗📄 paper★★★☆☆arXivModel Evaluation for Extreme RisksToby Shevlane, Sebastian Farquhar, Ben Garfinkel et al. (2023)Source ↗Notes |
| Ouyang et al. (2022) | Training language models to follow instructions with human feedback | arXiv:2203.02155↗📄 paper★★★☆☆arXivTraining Language Models to Follow Instructions with Human FeedbackLong Ouyang, Jeff Wu, Xu Jiang et al. (2022)Source ↗Notes |
| Hubinger et al. (2024) | Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training | arXiv:2401.05566↗📄 paper★★★☆☆arXivSleeper AgentsHubinger, Evan, Denison, Carson, Mu, Jesse et al. (2024)A study exploring deceptive behavior in AI models by creating backdoors that trigger different responses based on context. The research demonstrates significant challenges in re...Source ↗Notes |
Organization Reports
Section titled “Organization Reports”| Organization | Report | Focus | Link |
|---|---|---|---|
| Anthropic | Responsible Scaling Policy | Layer implementation framework | anthropic.com↗🔗 web★★★★☆AnthropicStaged deploymentSource ↗Notes |
| METR | Model Evaluation Research | Evaluation layer gaps | metr.org↗🔗 web★★★★☆METRmetr.orgSource ↗Notes |
| MIRI | Security Mindset and AI Alignment | Adversarial perspective | intelligence.org↗🔗 web★★★☆☆MIRIintelligence.orgSource ↗Notes |
| RAND | Defense in Depth for AI Systems | Military security applications | rand.org↗🔗 web★★★★☆RAND CorporationRAND: AI and National SecuritySource ↗Notes |
Policy Documents
Section titled “Policy Documents”| Document | Jurisdiction | Relevance | Link |
|---|---|---|---|
| EU AI Act | European Union | Regulatory requirements for layered oversight | digital-strategy.ec.europa.eu↗🔗 web★★★★☆European UnionEU AI OfficeSource ↗Notes |
| Executive Order on AI | United States | Federal approach to AI safety requirements | whitehouse.gov↗🏛️ government★★★★☆White HouseBiden Administration AI Executive Order 14110Source ↗Notes |
| UK AI Safety Summit | United Kingdom | International coordination on safety measures | gov.uk↗🏛️ government★★★★☆UK Governmentgov.ukSource ↗Notes |
Related Models and Concepts
Section titled “Related Models and Concepts”- Capability Threshold ModelModelCapability Threshold ModelProvides systematic framework mapping 5 AI capability dimensions to specific risk thresholds with quantified timelines: authentication collapse 85% likely 2025-2027, bioweapons development 40% like...Quality: 68/100 - When individual defenses become insufficient
- Deceptive Alignment DecompositionModelDeceptive Alignment Decomposition ModelQuantitative framework decomposing deceptive alignment probability into five multiplicative factors (mesa-optimization 30-70%, misaligned objectives 40-80%, situational awareness 50-90%, strategic ...Quality: 68/100 - Primary correlation driver
- AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100 - Defense assuming potential deception
- Responsible Scaling PoliciesPolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100 - Institutional layer implementation
What links here
- Societal Resilienceai-transition-model-parameteranalyzed-by