Deployment / Safety Architectures

Columns:Presets:

How AI systems are organized for safety. These architectures are largely model-agnostic - they can be applied to transformers, SSMs, or future architectures. The key question: how do we structure AI systems to maintain oversight and safety?

Key insight: Lower agency + more decomposition + better oversight = generally safer. But there are tradeoffs with capability and practicality. See also: Model Architectures for what the AI is made of.

Basic PatternsMinimal to light scaffolding approaches
Current adoption level
Overall safety assessment
Level of autonomous decision-making
How tasks are broken down
Human oversight mechanism
Interpretability of internals
Key SourcesSafety ProsSafety Cons
Monolithic / Minimal Scaffolding
Single model with direct API access. No persistent memory, minimal tools. Like ChatGPT web or basic Claude usage.
DECLINING
Legacy pattern; scaffolding adds clear value
Now (legacy)
5/10
Mixed
Simple threat model but limited interpretability
HIGH
Single model makes all decisions
NONE
Single forward pass or CoT in one context
MINIMAL
Human sees inputs/outputs only
LOW
Model internals opaque
+ Simple to analyze
+ Limited action space
Model internals opaque
Relies entirely on training
Light Scaffolding
Model + basic tool use + simple chains. RAG, function calling, single-agent loops. Like GPT with plugins.
HIGH
Current mainstream; most deployed systems
Now - 2027
5/10
Mixed
Tool use adds capability and risk; scaffold provides some inspection
MEDIUM-HIGH
Model retains most decision-making
BASIC
Simple tool calls, RAG retrieval
HUMAN (limited)
Tool permissions controllable
MEDIUM
Scaffold code readable; model opaque
RAG(2020)
+ Scaffold logic inspectable
+ Tool permissions controllable
Tool use enables real-world harm
Model decisions still opaque
Tool AI / Oracle
Bostrom's taxonomy. Tool: narrow software-like, no persistent goals. Oracle: questions only. Agency structurally constrained.
MEDIUM
Used for specialized applications
Now - ongoing
7/10
Favorable
Safety through limitation; capability traded for safety
MINIMAL
No persistent goals, narrow scope
N/A
Scope-restricted instead of decomposed
SCOPE CONSTRAINT
Safety through limitation
LOW-MEDIUM
Limited scope means less to inspect
+ Minimal agency
+ Clear boundaries
+ Human decides actions
Limited capability
May develop emergent agency
Structured Safety ArchitecturesArchitectures designed with safety properties
Current adoption level
Overall safety assessment
Level of autonomous decision-making
How tasks are broken down
Human oversight mechanism
Interpretability of internals
Key SourcesSafety ProsSafety Cons
CAIS / Service-Based
Comprehensive AI Services (Drexler). Many task-specific services rather than unified agents. Agency optional, not default.
LOW-MEDIUM
15-25% chance becomes dominant paradigm
2026-2032
7/10
Favorable
Avoids unified agency; bounded goals; ongoing human direction
LOW
Services have bounded, task-specific goals
SERVICE-LEVEL
System = many specialized services
HUMAN
Services informed by models of human approval
PARTIAL
Service boundaries visible; internals may be opaque
+ No unified agency
+ Bounded goals
+ Incremental development
Coordination complexity
Agency creep possible
Foundation model gap
AI Control / Hierarchical (T/U)
Redwood Research. Trusted model T monitors untrusted model U. Designed to work even if U intentionally subverts. 92% safety / 94% usefulness.
MEDIUM
20-35% chance; active research area
2025-2030
7/10
Favorable
Maintains safety despite intentional subversion attempts
MEDIUM
U has capability but T constrains
T/U SPLIT
Binary: trusted monitors untrusted
AI MONITORS AI
Scalable without linear human effort
PARTIAL
Protocol logic visible; U internals constrained
+ Works despite subversion
+ Scalable oversight
+ Empirically validated
Capability gap may close
Collusion risk
T must be trustworthy
Factored Cognition
Decompose tasks into isolated child tasks solved independently. Factor(T,U): 41% to 63% safety improvement. Simpler contexts for monitors.
MEDIUM
15-30% chance; complements AI Control
2025-2030
7/10
Favorable
Decomposition creates easier tasks and clearer contexts
LOW-MEDIUM
Components have limited context
TASK-LEVEL
Complex tasks into isolated children
CONTEXT ISOLATION
Attacks more obvious on simpler tasks
PARTIAL-HIGH
Decomposition structure visible
+ Simpler contexts
+ Attacks more obvious
+ Compositional safety
Decomposition limits
Information loss
Usefulness cost
Open Agency Architecture
Drexler/davidad. Separate goal-setting, planning, evaluation, execution. Plans externalized and interpretable. Foundation for Provably Safe AI.
LOW
5-15% chance; ambitious long-term agenda
2027-2035
8/10
Favorable
Designed for formal analysis; externalized plans
MEDIUM (bounded)
Bounded tasks with time/budget constraints
ROLE SEPARATION
Separate goal/plan/evaluate/execute
EXTERNALIZED PLANS
Plans interpretable, not opaque
HIGH
Designed for formal analysis
+ Externalized plans
+ Formal analysis
+ Role separation
May not scale to TAI
Specification difficulty
Capability tax
Safety-First Cognitive Architectures
Federated architectures with transparent inter-component communication. Separate planning, execution, memory. Interpretability by design.
LOW-MEDIUM
Emerging field; underdeveloped
2025-2030
7/10
Favorable
Intelligence from separate, non-agentic systems
MEDIUM (federated)
No single unified agent
COGNITIVE ROLES
Separate planning/execution/memory
TRANSPARENT COMMS
Human-readable, rate-controlled
HIGH
Communication channels visible by design
+ Transparent by design
+ Federated = no unified agent
+ Rate-controlled
Field underdeveloped
May not compete on capability
Oversight MechanismsMethods for supervising AI systems
Current adoption level
Overall safety assessment
Level of autonomous decision-making
How tasks are broken down
Human oversight mechanism
Interpretability of internals
Key SourcesSafety ProsSafety Cons
Process Supervision
OpenAI. Reward each reasoning step, not just outcome. 78.2% vs 72.4% on MATH. Deployed in o1 models. Detects bad reasoning.
HIGH
Already deployed in production (o1)
Now - expanding
7/10
Favorable
Step-by-step verification catches bad reasoning
VARIABLE
Doesn't constrain agency directly
STEP-BY-STEP
Reasoning into verifiable steps
STEP VERIFICATION
Each step evaluated for correctness
MEDIUM-HIGH
Reasoning steps visible
+ Catches bad reasoning
+ Deployed at scale
+ Strong empirical results
May not transfer to all domains
Process-outcome gap
Alien reasoning risk
Debate / Adversarial Oversight
Irving et al. Two AIs argue opposing positions, human judges. Truth should win via adversarial scrutiny. 60-80% on factual Qs.
LOW-MEDIUM
Research stage; promising but challenges
2026-2032
6/10
Mixed
Promising but vulnerable to sophisticated deception
MEDIUM
Debaters have agency within format
ADVERSARIAL SPLIT
Two perspectives, not task decomposition
HUMAN JUDGE
Human evaluates which argument wins
MEDIUM
Arguments visible; model internals opaque
+ Forces consideration of counterarguments
+ Truth advantage theory
+ Externalized reasoning
Sophisticated deception may win
Confidence escalation
Complex reasoning struggles
IDA / Iterated Amplification
Christiano. Amplify weak agents via delegation, distill to faster models, iterate. Recursive decomposition. Related to AlphaGoZero.
LOW
Training methodology; research stage
Research stage
6/10
Mixed
Theoretical promise; limited empirical validation
LOW (during training)
Weak agents amplified
RECURSIVE
Task → subtasks → copies → integrate
AMPLIFIED HUMAN
Human judgment amplified via AI
MEDIUM
Training structure visible; distilled model less so
+ Human values preserved
+ Recursive safety
+ Theoretical elegance
Limited empirical validation
Decomposition limits unclear
Distillation may lose alignment
11 architectures across 3 categories