This table summarizes which AI safety research approaches are likely to generalize to future AI architectures, and what conditions they depend on. Approaches are ordered from lowest to highest expected generalization.
Expected generalization to future AI architectures | Requires (to work) | Threatened by | |
|---|---|---|---|
Mechanistic Interpretability Circuit-level understanding of model internals. High value if it works, but highly dependent on architecture stability and access. Circuits, probing, activation patching | LOW |
|
|
Training-Based Alignment Shaping model behavior through training signals (RLHF, Constitutional AI, debate). Requires training access but somewhat architecture-agnostic. RLHF, Constitutional AI, debate | MEDIUM |
|
|
Black-box Evaluations Behavioral testing, capability evals, red-teaming. Only requires query access, relatively architecture-agnostic. Capability evals, red-teaming, benchmarks | MEDIUM-HIGH |
|
|
Control & Containment Boxing, monitoring, tripwires, capability control. Focuses on constraining systems regardless of their internals. Sandboxing, monitoring, kill switches | HIGH |
| Few threats identified |
Theoretical Alignment Mathematical frameworks, optimization theory, agent foundations. Architecture-independent by nature. Agent foundations, decision theory, formal frameworks | HIGHEST |
| Few threats identified |