Technical Innovations

Emerging capabilities and techniques in AI systems. Unlike architecture choices, these are not mutually exclusive - a single system can exhibit multiple innovations simultaneously. The safety assessment indicates the outlook for each innovation.

Innovation	Interp.	Channel	Safety	Prevalence	Timeline	Tractability	Key Risks	Key Opportunities
Neuralese	Opaque	Internal	3/10challenging	60-80%	Now (dominant)	LOW	• Deception undetectable in principle • Goals unverifiable	• Mechanistic interp making progress • May be extractable with sufficient compute
Interpretable-by-Design Architectures	Inherent	Human-Facing	8/10favorable	5-15%	2027+	MEDIUM	• May not scale to TAI capabilities • Capability tax reduces adoption	• Auditable by construction • Natural language specs possible
Shared Latent Spaces	Opaque	AI-AI Overt	3/10challenging	30-50%	2025-2030	LOW	• Collusion undetectable • Humans excluded from AI-AI communication	• Can mandate natural language interface layer • Logging/monitoring possible at interface
Chain-of-Thought Reasoning	Hybrid	Human-Facing	6/10mixed	20-40%	2025-2030	HIGH	• CoT may not reflect true reasoning • Performative explanations	• Process-based oversight possible • Reasoning auditable
Steganographic Capacity	Opaque	AI-AI Covert	2/10challenging	30-50%	2025-2030	MEDIUM	• Undetectable coordination • Safety measure evasion	• Paraphrasing countermeasures • Statistical detection possible
Emergent Communication Protocols	Opaque	AI-AI Covert	2/10challenging	20-40%	2026-2032	LOW	• Collusion without explicit coordination • Humans cannot monitor	• Can mandate natural language interface • Detectable if looking for it
Mechanistic Interpretability	Post-hoc	Internal	5/10mixed	40-60%	Now - 2030	HIGH	• May not scale to full understanding • Circuits interact in complex ways	• Reverse-engineering possible • Can identify specific capabilities
Situational Awareness	Opaque	Internal	2/10challenging	50-70%	Now (emerging)	MEDIUM	• Foundation for deceptive alignment • Enables strategic behavior modification	• Can probe for self-models • May enable better oversight if understood
Explicit World Models	Hybrid	Internal	7/10favorable	15-30%	2027-2035	MEDIUM	• World model may be wrong in dangerous ways • Planning on wrong model compounds errors	• Inspectable beliefs • Can verify world model accuracy