Longterm Wiki

Preference Optimization Methods

Post-RLHF training techniques including DPO, ORPO, KTO, IPO, and GRPO that align language models with human preferences more efficiently than reinforcement learning. DPO reduces costs by 40-60% while matching RLHF performance on dialogue tasks, though PPO still outperforms on reasoning and safety tasks.

Related

Related Pages

Top Related Pages

Risks

Goal MisgeneralizationEpistemic SycophancySycophancy

Analysis

Reward Hacking Taxonomy and Severity Model

Approaches

Representation EngineeringAI AlignmentRefusal TrainingReward Modeling

Other

AI EvaluationsDario AmodeiJan Leike

Concepts

Alignment Training OverviewLarge Language ModelsReasoning and PlanningLarge Language ModelsSelf-Improvement and Recursive EnhancementState-Space Models / Mamba

Key Debates

Technical AI Safety Research

Organizations

Google DeepMind

Tags

dpopreference-optimizationrlhftraining-efficiencyalignment-training