Longterm Wiki
Navigation
Updated 2026-03-13HistoryData
Page StatusResponseTable
Edited today1 backlinksUpdated monthlyDue in 4 weeks
20QualityDraft19ImportancePeripheral18ResearchMinimal
Summary

An interactive sortable table summarizing which AI safety approaches are likely to generalize to future architectures. Shows generalization level, dependencies, and threats for each approach.

Content3/12
LLM summaryScheduleEntityEdit history1
Tables1/ ~2Diagrams0Int. links0/ ~3Ext. links0/ ~1Footnotes0/ ~1References0/ ~1Quotes0Accuracy0RatingsN:3 R:3 A:3 C:3Backlinks1
Change History1
Remove legacy pageTemplate frontmatter3 weeks ago

Removed the legacy `pageTemplate` frontmatter field from 15 MDX files. This field was carried over from the Astro/Starlight era and is not used by the Next.js application.

opus-4-6 · ~10min

Safety Generalizability Table

Columns:|
Expected generalization to future AI architectures
Requires (to work)Threatened by
Mechanistic Interpretability
Circuit-level understanding of model internals. High value if it works, but highly dependent on architecture stability and access.
Circuits, probing, activation patching
LOW
  • White-box access available?
  • Representations converge?
  • Architecture stable enough?
  • Heavy scaffolding?
  • Novel architecture emerges?
Training-Based Alignment
Shaping model behavior through training signals (RLHF, Constitutional AI, debate). Requires training access but somewhat architecture-agnostic.
RLHF, Constitutional AI, debate
MEDIUM
  • Training access available?
  • Gradient-based training continues?
  • Single trainable system?
  • Long distillation chains?
Black-box Evaluations
Behavioral testing, capability evals, red-teaming. Only requires query access, relatively architecture-agnostic.
Capability evals, red-teaming, benchmarks
MEDIUM-HIGH
  • Query access available?
  • Behavior predictable enough?
  • Emergent multi-agent behavior?
Control & Containment
Boxing, monitoring, tripwires, capability control. Focuses on constraining systems regardless of their internals.
Sandboxing, monitoring, kill switches
HIGH
  • Sandboxing feasible?
  • Monitoring effective?
  • Capability boundaries clear?
Few threats identified
Theoretical Alignment
Mathematical frameworks, optimization theory, agent foundations. Architecture-independent by nature.
Agent foundations, decision theory, formal frameworks
HIGHEST
  • Math applies to real systems?
Few threats identified
5 safety approaches