This matrix shows which safety research approaches are likely to work under different future AI architecture scenarios. Green = likely works, yellow = partially works or needs adaptation, red = unlikely to work, gray = unknown.
| Safety Approach | Scaled Transformers HIGH (40-60%) | Scaffolded Agent Systems HIGH (already emerging) | State-Space Models (SSMs) MEDIUM (20-35%) | Hybrid Neuro-Symbolic MEDIUM (15-30%) | Novel Unknown Architecture LOW-MEDIUM (10-25%) |
|---|---|---|---|---|---|
| Mechanistic InterpretabilityLOW | Best case: stable arch, white-box access | Which component? Emergent behavior hard to trace | Different internals, needs new techniques | Neural parts opaque, symbolic parts clear | Current techniques unlikely to transfer |
| Training-Based AlignmentMEDIUM | Standard RLHF pipeline applies | Can train base models, not scaffold logic | Gradient-based training still works | Neural parts trainable, symbolic parts not | May not use gradient descent |
| Black-box EvaluationsMEDIUM-HIGH | Query access, predictable behavior | Emergent behavior hard to eval comprehensively | Behavioral testing still applicable | Can test end-to-end behavior | If queryable, basic evals work |
| Control & ContainmentHIGH | Standard containment approaches | Multi-component systems harder to box | Architecture-agnostic principles apply | Symbolic parts may be easier to constrain | Core principles likely transfer |
| Theoretical AlignmentHIGHEST | Theory is architecture-independent | Agent theory applies to any agent | Mathematical frameworks still valid | May be easier to formally verify | Core theory transcends architecture |