Adversarial Training
Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends ext
Related Pages
Top Related Pages
Circuit Breakers / Inference Interventions
Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference.
FAR AI
AI safety research nonprofit founded in 2022 by Adam Gleave and Karl Berzins, focusing on adversarial robustness, model evaluation, and alignment r...
AI Safety Intervention Effectiveness Matrix
This model maps 15+ AI safety interventions to specific risk categories with quantitative effectiveness estimates derived from empirical research a...
Redwood Research
A nonprofit AI safety and security research organization founded in 2021, known for pioneering AI Control research, developing causal scrubbing int...
Sleeper Agents: Training Deceptive LLMs
Anthropic's 2024 research demonstrating that large language models can be trained to exhibit persistent deceptive behavior that survives standard s...