Weak-to-Strong Generalization
Weak-to-strong generalization investigates whether weak supervisors can reliably elicit good behavior from stronger AI systems. OpenAI's ICML 2024 research shows GPT-2-level models can recover 80% of GPT-4's performance gap with auxiliary confidence loss, but reward modeling achieves only 20-40% PGR.
Related
Related Pages
Top Related Pages
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...
Reward Hacking
AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying eva...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
AI-Assisted Alignment
This response uses current AI systems to assist with alignment research tasks including red-teaming, interpretability, and recursive oversight.