Deployment / Safety Architectures

Columns:Presets:

How AI systems are organized for safety. These architectures are largely model-agnostic - they can be applied to transformers, SSMs, or future architectures. The key question: how do we structure AI systems to maintain oversight and safety?

Key insight: Lower agency + more decomposition + better oversight = generally safer. But there are tradeoffs with capability and practicality. See also: Model Architectures for what the AI is made of.

Basic Patterns — Minimal to light scaffolding approaches

	Current adoption level	Overall safety assessment	Level of autonomous decision-making	How tasks are broken down	Human oversight mechanism	Interpretability of internals	Key Sources	Safety Pros	Safety Cons
Monolithic / Minimal Scaffolding Single model with direct API access. No persistent memory, minimal tools. Like ChatGPT web or basic Claude usage.	DECLINING Legacy pattern; scaffolding adds clear value Now (legacy)	5/10 Mixed Simple threat model but limited interpretability	HIGH Single model makes all decisions	NONE Single forward pass or CoT in one context	MINIMAL Human sees inputs/outputs only	LOW Model internals opaque	InstructGPT(2022)	+ Simple to analyze + Limited action space	− Model internals opaque − Relies entirely on training
Light Scaffolding Model + basic tool use + simple chains. RAG, function calling, single-agent loops. Like GPT with plugins.	HIGH Current mainstream; most deployed systems Now - 2027	5/10 Mixed Tool use adds capability and risk; scaffold provides some inspection	MEDIUM-HIGH Model retains most decision-making	BASIC Simple tool calls, RAG retrieval	HUMAN (limited) Tool permissions controllable	MEDIUM Scaffold code readable; model opaque	Toolformer(2023) RAG(2020)	+ Scaffold logic inspectable + Tool permissions controllable	− Tool use enables real-world harm − Model decisions still opaque
Tool AI / Oracle Bostrom's taxonomy. Tool: narrow software-like, no persistent goals. Oracle: questions only. Agency structurally constrained.	MEDIUM Used for specialized applications Now - ongoing	7/10 Favorable Safety through limitation; capability traded for safety	MINIMAL No persistent goals, narrow scope	N/A Scope-restricted instead of decomposed	SCOPE CONSTRAINT Safety through limitation	LOW-MEDIUM Limited scope means less to inspect	Superintelligence (Bostrom)(2014)	+ Minimal agency + Clear boundaries + Human decides actions	− Limited capability − May develop emergent agency

Structured Safety Architectures — Architectures designed with safety properties

	Current adoption level	Overall safety assessment	Level of autonomous decision-making	How tasks are broken down	Human oversight mechanism	Interpretability of internals	Key Sources	Safety Pros	Safety Cons
CAIS / Service-Based Comprehensive AI Services (Drexler). Many task-specific services rather than unified agents. Agency optional, not default.	LOW-MEDIUM 15-25% chance becomes dominant paradigm 2026-2032	7/10 Favorable Avoids unified agency; bounded goals; ongoing human direction	LOW Services have bounded, task-specific goals	SERVICE-LEVEL System = many specialized services	HUMAN Services informed by models of human approval	PARTIAL Service boundaries visible; internals may be opaque	Reframing Superintelligence(2019) CAIS - EA Forum	+ No unified agency + Bounded goals + Incremental development	− Coordination complexity − Agency creep possible − Foundation model gap
AI Control / Hierarchical (T/U) Redwood Research. Trusted model T monitors untrusted model U. Designed to work even if U intentionally subverts. 92% safety / 94% usefulness.	MEDIUM 20-35% chance; active research area 2025-2030	7/10 Favorable Maintains safety despite intentional subversion attempts	MEDIUM U has capability but T constrains	T/U SPLIT Binary: trusted monitors untrusted	AI MONITORS AI Scalable without linear human effort	PARTIAL Protocol logic visible; U internals constrained	AI Control Paper(2024) Redwood Research	+ Works despite subversion + Scalable oversight + Empirically validated	− Capability gap may close − Collusion risk − T must be trustworthy
Factored Cognition Decompose tasks into isolated child tasks solved independently. Factor(T,U): 41% to 63% safety improvement. Simpler contexts for monitors.	MEDIUM 15-30% chance; complements AI Control 2025-2030	7/10 Favorable Decomposition creates easier tasks and clearer contexts	LOW-MEDIUM Components have limited context	TASK-LEVEL Complex tasks into isolated children	CONTEXT ISOLATION Attacks more obvious on simpler tasks	PARTIAL-HIGH Decomposition structure visible	Factor(T,U)(2024) Factored Cognition - LessWrong	+ Simpler contexts + Attacks more obvious + Compositional safety	− Decomposition limits − Information loss − Usefulness cost
Open Agency Architecture Drexler/davidad. Separate goal-setting, planning, evaluation, execution. Plans externalized and interpretable. Foundation for Provably Safe AI.	LOW 5-15% chance; ambitious long-term agenda 2027-2035	8/10 Favorable Designed for formal analysis; externalized plans	MEDIUM (bounded) Bounded tasks with time/budget constraints	ROLE SEPARATION Separate goal/plan/evaluate/execute	EXTERNALIZED PLANS Plans interpretable, not opaque	HIGH Designed for formal analysis	Open Agency Model ARIA Safeguarded AI	+ Externalized plans + Formal analysis + Role separation	− May not scale to TAI − Specification difficulty − Capability tax
Safety-First Cognitive Architectures Federated architectures with transparent inter-component communication. Separate planning, execution, memory. Interpretability by design.	LOW-MEDIUM Emerging field; underdeveloped 2025-2030	7/10 Favorable Intelligence from separate, non-agentic systems	MEDIUM (federated) No single unified agent	COGNITIVE ROLES Separate planning/execution/memory	TRANSPARENT COMMS Human-readable, rate-controlled	HIGH Communication channels visible by design	Safety-First Agents	+ Transparent by design + Federated = no unified agent + Rate-controlled	− Field underdeveloped − May not compete on capability

Oversight Mechanisms — Methods for supervising AI systems

	Current adoption level	Overall safety assessment	Level of autonomous decision-making	How tasks are broken down	Human oversight mechanism	Interpretability of internals	Key Sources	Safety Pros	Safety Cons
Process Supervision OpenAI. Reward each reasoning step, not just outcome. 78.2% vs 72.4% on MATH. Deployed in o1 models. Detects bad reasoning.	HIGH Already deployed in production (o1) Now - expanding	7/10 Favorable Step-by-step verification catches bad reasoning	VARIABLE Doesn't constrain agency directly	STEP-BY-STEP Reasoning into verifiable steps	STEP VERIFICATION Each step evaluated for correctness	MEDIUM-HIGH Reasoning steps visible	Let's Verify Step by Step(2023) PRM800K(2023)	+ Catches bad reasoning + Deployed at scale + Strong empirical results	− May not transfer to all domains − Process-outcome gap − Alien reasoning risk
Debate / Adversarial Oversight Irving et al. Two AIs argue opposing positions, human judges. Truth should win via adversarial scrutiny. 60-80% on factual Qs.	LOW-MEDIUM Research stage; promising but challenges 2026-2032	6/10 Mixed Promising but vulnerable to sophisticated deception	MEDIUM Debaters have agency within format	ADVERSARIAL SPLIT Two perspectives, not task decomposition	HUMAN JUDGE Human evaluates which argument wins	MEDIUM Arguments visible; model internals opaque	AI Safety via Debate(2018) Debate improves judge accuracy(2024)	+ Forces consideration of counterarguments + Truth advantage theory + Externalized reasoning	− Sophisticated deception may win − Confidence escalation − Complex reasoning struggles
IDA / Iterated Amplification Christiano. Amplify weak agents via delegation, distill to faster models, iterate. Recursive decomposition. Related to AlphaGoZero.	LOW Training methodology; research stage Research stage	6/10 Mixed Theoretical promise; limited empirical validation	LOW (during training) Weak agents amplified	RECURSIVE Task → subtasks → copies → integrate	AMPLIFIED HUMAN Human judgment amplified via AI	MEDIUM Training structure visible; distilled model less so	Iterated Distillation and Amplification(2018) Supervising strong learners(2018)	+ Human values preserved + Recursive safety + Theoretical elegance	− Limited empirical validation − Decomposition limits unclear − Distillation may lose alignment

11 architectures across 3 categories