Deployment & Control
Deployment methods focus on maintaining safety during AI system operation.
Containment:
- SandboxingSandboxingComprehensive analysis of AI sandboxing as defense-in-depth, synthesizing METR's 2025 evaluations (GPT-5 time horizon ~2h, capabilities doubling every 7 months), AI boxing experiments (60-70% escap...Quality: 91/100: Isolating AI systems from the outside world
- AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100: Maintaining human oversight and control
Access Management:
- Structured AccessStructured AccessStructured access (API-only deployment) provides meaningful safety benefits through monitoring (80-95% detection rates), intervention capability, and controlled proliferation. Enterprise LLM spend ...Quality: 91/100: Tiered access to model capabilities
- Tool RestrictionsTool RestrictionsTool-use restrictions provide hard limits on AI agent capabilities through defense-in-depth approaches combining permissions, sandboxing, and human-in-the-loop controls. Empirical evidence shows ME...Quality: 91/100: Limiting available actions and tools
Output Safety:
- Output FilteringOutput FilteringComprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbreaks per UK AISI testing, though Anthropic's Const...Quality: 63/100: Screening model outputs for harm
Multi-System:
- Multi-Agent SafetyMulti AgentMulti-agent safety addresses coordination failures, conflict, and collusion risks when AI systems interact. A 2025 report from 50+ researchers identifies seven key risk factors; empirical studies s...Quality: 68/100: Safety in systems with multiple AI agents