Longterm Wiki

Weak-to-Strong Generalization

Weak-to-strong generalization investigates whether weak supervisors can reliably elicit good behavior from stronger AI systems. OpenAI's ICML 2024 research shows GPT-2-level models can recover 80% of GPT-4's performance gap with auxiliary confidence loss, but reward modeling achieves only 20-40% PGR.

Related

Related Pages

Top Related Pages

Risks

Goal Misgeneralization

Analysis

Alignment Robustness Trajectory ModelReward Hacking Taxonomy and Severity Model

Approaches

AI AlignmentAI Safety via DebateProcess SupervisionReward Modeling

Other

RLHFJan LeikeScalable OversightLeopold AschenbrennerMechanistic Interpretability

Key Debates

AI Safety Solution CruxesTechnical AI Safety ResearchAI Alignment Research Agendas

Organizations

Elicit (AI Research Tool)Google DeepMind

Concepts

Large Language ModelsAlignment Training Overview

Historical

Mainstream Era

Tags

weak-to-strongscalable-oversightsuperalignmentsupervisionreward-modeling