Safety Research Generalizability

Columns:Presets:

This table summarizes which AI safety research approaches are likely to generalize to future AI architectures, and what conditions they depend on. Approaches are ordered from lowest to highest expected generalization.

Expected generalization to future AI architectures
Requires (to work)Threatened by
Mechanistic Interpretability
Circuit-level understanding of model internals. High value if it works, but highly dependent on architecture stability and access.
Circuits, probing, activation patching
LOW
  • White-box access available?
  • Representations converge?
  • Architecture stable enough?
  • Heavy scaffolding?
  • Novel architecture emerges?
Training-Based Alignment
Shaping model behavior through training signals (RLHF, Constitutional AI, debate). Requires training access but somewhat architecture-agnostic.
RLHF, Constitutional AI, debate
MEDIUM
  • Training access available?
  • Gradient-based training continues?
  • Single trainable system?
  • Long distillation chains?
Black-box Evaluations
Behavioral testing, capability evals, red-teaming. Only requires query access, relatively architecture-agnostic.
Capability evals, red-teaming, benchmarks
MEDIUM-HIGH
  • Query access available?
  • Behavior predictable enough?
  • Emergent multi-agent behavior?
Control & Containment
Boxing, monitoring, tripwires, capability control. Focuses on constraining systems regardless of their internals.
Sandboxing, monitoring, kill switches
HIGH
  • Sandboxing feasible?
  • Monitoring effective?
  • Capability boundaries clear?
Few threats identified
Theoretical Alignment
Mathematical frameworks, optimization theory, agent foundations. Architecture-independent by nature.
Agent foundations, decision theory, formal frameworks
HIGHEST
  • Math applies to real systems?
Few threats identified
5 safety approaches