How AI systems are organized for safety. These architectures are largely model-agnostic - they can be applied to transformers, SSMs, or future architectures. The key question: how do we structure AI systems to maintain oversight and safety?
Key insight: Lower agency + more decomposition + better oversight = generally safer. But there are tradeoffs with capability and practicality. See also: Model Architectures for what the AI is made of.
Current adoption level | Overall safety assessment | Level of autonomous decision-making | How tasks are broken down | Human oversight mechanism | Interpretability of internals | Key Sources | Safety Pros | Safety Cons | |
|---|---|---|---|---|---|---|---|---|---|
Monolithic / Minimal Scaffolding Single model with direct API access. No persistent memory, minimal tools. Like ChatGPT web or basic Claude usage. | DECLINING Legacy pattern; scaffolding adds clear value Now (legacy) | 5/10 MixedSimple threat model but limited interpretability | HIGH Single model makes all decisions | NONE Single forward pass or CoT in one context | MINIMAL Human sees inputs/outputs only | LOW Model internals opaque | InstructGPT(2022) | + Simple to analyze + Limited action space | − Model internals opaque − Relies entirely on training |
Light Scaffolding Model + basic tool use + simple chains. RAG, function calling, single-agent loops. Like GPT with plugins. | HIGH Current mainstream; most deployed systems Now - 2027 | 5/10 MixedTool use adds capability and risk; scaffold provides some inspection | MEDIUM-HIGH Model retains most decision-making | BASIC Simple tool calls, RAG retrieval | HUMAN (limited) Tool permissions controllable | MEDIUM Scaffold code readable; model opaque | Toolformer(2023) RAG(2020) | + Scaffold logic inspectable + Tool permissions controllable | − Tool use enables real-world harm − Model decisions still opaque |
Tool AI / Oracle Bostrom's taxonomy. Tool: narrow software-like, no persistent goals. Oracle: questions only. Agency structurally constrained. | MEDIUM Used for specialized applications Now - ongoing | 7/10 FavorableSafety through limitation; capability traded for safety | MINIMAL No persistent goals, narrow scope | N/A Scope-restricted instead of decomposed | SCOPE CONSTRAINT Safety through limitation | LOW-MEDIUM Limited scope means less to inspect | + Minimal agency + Clear boundaries + Human decides actions | − Limited capability − May develop emergent agency |
Current adoption level | Overall safety assessment | Level of autonomous decision-making | How tasks are broken down | Human oversight mechanism | Interpretability of internals | Key Sources | Safety Pros | Safety Cons | |
|---|---|---|---|---|---|---|---|---|---|
CAIS / Service-Based Comprehensive AI Services (Drexler). Many task-specific services rather than unified agents. Agency optional, not default. | LOW-MEDIUM 15-25% chance becomes dominant paradigm 2026-2032 | 7/10 FavorableAvoids unified agency; bounded goals; ongoing human direction | LOW Services have bounded, task-specific goals | SERVICE-LEVEL System = many specialized services | HUMAN Services informed by models of human approval | PARTIAL Service boundaries visible; internals may be opaque | + No unified agency + Bounded goals + Incremental development | − Coordination complexity − Agency creep possible − Foundation model gap | |
AI Control / Hierarchical (T/U) Redwood Research. Trusted model T monitors untrusted model U. Designed to work even if U intentionally subverts. 92% safety / 94% usefulness. | MEDIUM 20-35% chance; active research area 2025-2030 | 7/10 FavorableMaintains safety despite intentional subversion attempts | MEDIUM U has capability but T constrains | T/U SPLIT Binary: trusted monitors untrusted | AI MONITORS AI Scalable without linear human effort | PARTIAL Protocol logic visible; U internals constrained | + Works despite subversion + Scalable oversight + Empirically validated | − Capability gap may close − Collusion risk − T must be trustworthy | |
Factored Cognition Decompose tasks into isolated child tasks solved independently. Factor(T,U): 41% to 63% safety improvement. Simpler contexts for monitors. | MEDIUM 15-30% chance; complements AI Control 2025-2030 | 7/10 FavorableDecomposition creates easier tasks and clearer contexts | LOW-MEDIUM Components have limited context | TASK-LEVEL Complex tasks into isolated children | CONTEXT ISOLATION Attacks more obvious on simpler tasks | PARTIAL-HIGH Decomposition structure visible | + Simpler contexts + Attacks more obvious + Compositional safety | − Decomposition limits − Information loss − Usefulness cost | |
Open Agency Architecture Drexler/davidad. Separate goal-setting, planning, evaluation, execution. Plans externalized and interpretable. Foundation for Provably Safe AI. | LOW 5-15% chance; ambitious long-term agenda 2027-2035 | 8/10 FavorableDesigned for formal analysis; externalized plans | MEDIUM (bounded) Bounded tasks with time/budget constraints | ROLE SEPARATION Separate goal/plan/evaluate/execute | EXTERNALIZED PLANS Plans interpretable, not opaque | HIGH Designed for formal analysis | + Externalized plans + Formal analysis + Role separation | − May not scale to TAI − Specification difficulty − Capability tax | |
Safety-First Cognitive Architectures Federated architectures with transparent inter-component communication. Separate planning, execution, memory. Interpretability by design. | LOW-MEDIUM Emerging field; underdeveloped 2025-2030 | 7/10 FavorableIntelligence from separate, non-agentic systems | MEDIUM (federated) No single unified agent | COGNITIVE ROLES Separate planning/execution/memory | TRANSPARENT COMMS Human-readable, rate-controlled | HIGH Communication channels visible by design | + Transparent by design + Federated = no unified agent + Rate-controlled | − Field underdeveloped − May not compete on capability |
Current adoption level | Overall safety assessment | Level of autonomous decision-making | How tasks are broken down | Human oversight mechanism | Interpretability of internals | Key Sources | Safety Pros | Safety Cons | |
|---|---|---|---|---|---|---|---|---|---|
Process Supervision OpenAI. Reward each reasoning step, not just outcome. 78.2% vs 72.4% on MATH. Deployed in o1 models. Detects bad reasoning. | HIGH Already deployed in production (o1) Now - expanding | 7/10 FavorableStep-by-step verification catches bad reasoning | VARIABLE Doesn't constrain agency directly | STEP-BY-STEP Reasoning into verifiable steps | STEP VERIFICATION Each step evaluated for correctness | MEDIUM-HIGH Reasoning steps visible | PRM800K(2023) | + Catches bad reasoning + Deployed at scale + Strong empirical results | − May not transfer to all domains − Process-outcome gap − Alien reasoning risk |
Debate / Adversarial Oversight Irving et al. Two AIs argue opposing positions, human judges. Truth should win via adversarial scrutiny. 60-80% on factual Qs. | LOW-MEDIUM Research stage; promising but challenges 2026-2032 | 6/10 MixedPromising but vulnerable to sophisticated deception | MEDIUM Debaters have agency within format | ADVERSARIAL SPLIT Two perspectives, not task decomposition | HUMAN JUDGE Human evaluates which argument wins | MEDIUM Arguments visible; model internals opaque | + Forces consideration of counterarguments + Truth advantage theory + Externalized reasoning | − Sophisticated deception may win − Confidence escalation − Complex reasoning struggles | |
IDA / Iterated Amplification Christiano. Amplify weak agents via delegation, distill to faster models, iterate. Recursive decomposition. Related to AlphaGoZero. | LOW Training methodology; research stage Research stage | 6/10 MixedTheoretical promise; limited empirical validation | LOW (during training) Weak agents amplified | RECURSIVE Task → subtasks → copies → integrate | AMPLIFIED HUMAN Human judgment amplified via AI | MEDIUM Training structure visible; distilled model less so | + Human values preserved + Recursive safety + Theoretical elegance | − Limited empirical validation − Decomposition limits unclear − Distillation may lose alignment |