Longterm Wiki

Adversarial Training

Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends ext

Related Pages

Top Related Pages

Risks

Deceptive AlignmentGoal MisgeneralizationDeepfakes

Approaches

Reward ModelingProcess SupervisionCooperative AIRefusal Training

Key Debates

AI Accident Risk CruxesWhy Alignment Might Be HardIs AI Existential Risk Real?

Other

Red TeamingPaul Christiano

Concepts

Safety Orgs OverviewAlignment Training OverviewAI Misuse