Longterm Wiki

Direct Preference Optimization

Alignment Trainingactive

Family of reward-free alignment methods (DPO, KTO, IPO, ORPO, GRPO) that bypass explicit reward model training.

Organizations
3
Key Papers
1
First Proposed: 2023 (Rafailov et al.)
Cluster: Alignment Training
Parent Area: RLHF

Tags

function:specificationstage:trainingscope:technique

Organizations3

OrganizationRole
Anthropicactive
Google DeepMindactive
OpenAIactive

Key Papers & Resources1