← Intelligence Paradigms

Safety × Architecture Compatibility Matrix

This matrix shows which safety research approaches are likely to work under different future AI architecture scenarios. Green = likely works, yellow = partially works or needs adaptation, red = unlikely to work, gray = unknown.

Safety ApproachScaled Transformers
HIGH (40-60%)
Scaffolded Agent Systems
HIGH (already emerging)
State-Space Models (SSMs)
MEDIUM (20-35%)
Hybrid Neuro-Symbolic
MEDIUM (15-30%)
Novel Unknown Architecture
LOW-MEDIUM (10-25%)
Mechanistic InterpretabilityLOW
Best case: stable arch, white-box access
Which component? Emergent behavior hard to trace
Different internals, needs new techniques
Neural parts opaque, symbolic parts clear
Current techniques unlikely to transfer
Training-Based AlignmentMEDIUM
Standard RLHF pipeline applies
Can train base models, not scaffold logic
Gradient-based training still works
Neural parts trainable, symbolic parts not
?May not use gradient descent
Black-box EvaluationsMEDIUM-HIGH
Query access, predictable behavior
Emergent behavior hard to eval comprehensively
Behavioral testing still applicable
Can test end-to-end behavior
If queryable, basic evals work
Control & ContainmentHIGH
Standard containment approaches
Multi-component systems harder to box
Architecture-agnostic principles apply
Symbolic parts may be easier to constrain
Core principles likely transfer
Theoretical AlignmentHIGHEST
Theory is architecture-independent
Agent theory applies to any agent
Mathematical frameworks still valid
May be easier to formally verify
Core theory transcends architecture
Works well
Partially works / needs adaptation
Unlikely to work
?Unknown