Reward Modeling
Reward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is universally adopted but inherits fundamental limitations including reward hacking (which worsens with capab
Related Pages
Top Related Pages
Weak-to-Strong Generalization
Weak-to-strong generalization investigates whether weak supervisors can reliably elicit good behavior from stronger AI systems.
Reward Hacking
AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying eva...
Process Supervision
Process supervision trains AI systems to produce correct reasoning steps, not just correct final answers.
RLHF
RLHF and Constitutional AI are the dominant techniques for aligning language models with human preferences.
Constitutional AI
Anthropic's Constitutional AI (CAI) methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-...