AI-Assisted Alignment
Using current AI systems to assist with alignment research tasks including red-teaming, interpretability, and recursive oversight. AI-assisted red-teaming reduces jailbreak success rates from 86% to 4.4%, and weak-to-strong generalization can recover GPT-3.5-level performance from GPT-2 supervision.
Related
Related Pages
Top Related Pages
Weak-to-Strong Generalization
Weak-to-strong generalization investigates whether weak supervisors can reliably elicit good behavior from stronger AI systems.
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...
Constitutional AI
Anthropic's Constitutional AI (CAI) methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-...
Alignment Robustness Trajectory Model
This model analyzes how alignment robustness changes with capability scaling. It estimates current techniques maintain 50-65% robustness at GPT-4 l...