Preference Optimization Methods
Post-RLHF training techniques including DPO, ORPO, KTO, IPO, and GRPO that align language models with human preferences more efficiently than reinforcement learning. DPO reduces costs by 40-60% while matching RLHF performance on dialogue tasks, though PPO still outperforms on reasoning and safety tasks.
Related
Related Pages
Top Related Pages
Reward Hacking
AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying eva...
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...
RLHF
RLHF and Constitutional AI are the dominant techniques for aligning language models with human preferences.
Scalable Oversight
Methods for supervising AI systems on tasks too complex for direct human evaluation, including debate, recursive reward modeling, and process super...