Emerging capabilities and techniques in AI systems. Unlike architecture choices, these are not mutually exclusive - a single system can exhibit multiple innovations simultaneously. The safety assessment indicates the outlook for each innovation.

InnovationInterp.ChannelSafetyPrevalenceTimelineTractabilityKey RisksKey Opportunities
Neuralese
OpaqueInternal
3/10challenging
60-80%
Now (dominant)LOW
  • Deception undetectable in principle
  • Goals unverifiable
  • Mechanistic interp making progress
  • May be extractable with sufficient compute
Interpretable-by-Design Architectures
InherentHuman-Facing
8/10favorable
5-15%
2027+MEDIUM
  • May not scale to TAI capabilities
  • Capability tax reduces adoption
  • Auditable by construction
  • Natural language specs possible
Shared Latent Spaces
OpaqueAI-AI Overt
3/10challenging
30-50%
2025-2030LOW
  • Collusion undetectable
  • Humans excluded from AI-AI communication
  • Can mandate natural language interface layer
  • Logging/monitoring possible at interface
Chain-of-Thought Reasoning
HybridHuman-Facing
6/10mixed
20-40%
2025-2030HIGH
  • CoT may not reflect true reasoning
  • Performative explanations
  • Process-based oversight possible
  • Reasoning auditable
Steganographic Capacity
OpaqueAI-AI Covert
2/10challenging
30-50%
2025-2030MEDIUM
  • Undetectable coordination
  • Safety measure evasion
  • Paraphrasing countermeasures
  • Statistical detection possible
Emergent Communication Protocols
OpaqueAI-AI Covert
2/10challenging
20-40%
2026-2032LOW
  • Collusion without explicit coordination
  • Humans cannot monitor
  • Can mandate natural language interface
  • Detectable if looking for it
Mechanistic Interpretability
Post-hocInternal
5/10mixed
40-60%
Now - 2030HIGH
  • May not scale to full understanding
  • Circuits interact in complex ways
  • Reverse-engineering possible
  • Can identify specific capabilities
Situational Awareness
OpaqueInternal
2/10challenging
50-70%
Now (emerging)MEDIUM
  • Foundation for deceptive alignment
  • Enables strategic behavior modification
  • Can probe for self-models
  • May enable better oversight if understood
Explicit World Models
HybridInternal
7/10favorable
15-30%
2027-2035MEDIUM
  • World model may be wrong in dangerous ways
  • Planning on wrong model compounds errors
  • Inspectable beliefs
  • Can verify world model accuracy