Longterm Wiki

Reward Modeling

Reward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is universally adopted but inherits fundamental limitations including reward hacking (which worsens with capab

Related Pages

Top Related Pages

Risks

Deceptive AlignmentGoal MisgeneralizationSycophancyMesa-Optimization

Approaches

AI Safety via DebateCooperative IRL (CIRL)Adversarial Training

Organizations

AnthropicOpenAIGoogle DeepMind

Key Debates

Why Alignment Might Be HardAI Safety Solution Cruxes

Other

Scalable OversightMechanistic InterpretabilityDario AmodeiPaul ChristianoJan Leike

Concepts

Large Language ModelsAlignment Training OverviewAI Misuse