Skip to content
Longterm Wiki
Back

A Comprehensive Survey of DPO

paper

Authors

Wenyi Xiao·Zechuan Wang·Leilei Gan·Shuai Zhao·Zongrui Li·Ruirui Lei·Wanggui He·Luu Anh Tuan·Long Chen·Hao Jiang·Zhou Zhao·Fei Wu

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A comprehensive 2024 survey useful for researchers seeking an organized overview of DPO methods and variants as an alternative to RLHF for LLM alignment; more technical reference than introductory material.

Paper Details

Citations
23
2 influential
Year
2024
Methodology
survey

Metadata

Importance: 62/100arxiv preprintreference

Abstract

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

Summary

This survey provides a systematic review of Direct Preference Optimization (DPO), an RL-free alternative to RLHF for aligning LLMs with human preferences. It categorizes recent research across theoretical analyses, algorithm variants, preference datasets, and applications, while identifying open challenges and proposing future research directions.

Key Points

  • DPO eliminates the need for a separate reward model and RL training loop, simplifying the RLHF pipeline while achieving competitive alignment performance.
  • The survey categorizes DPO research by key questions: theoretical foundations, algorithmic variants, dataset construction, and downstream applications.
  • Covers extensions of DPO to multimodal settings including Large Vision Language Models (LVLMs), broadening applicability beyond text-only alignment.
  • Identifies persistent challenges in DPO including distribution shift, data quality sensitivity, and theoretical gaps compared to full RLHF.
  • Maintains an actively updated GitHub repository of relevant papers, serving as a living reference for the alignment research community.

Cited by 1 page

PageTypeQuality
RLHFResearch Area63.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications 
 
 
 
 
 
 

 
 

 
 
 
 
 A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

 
 
 Wenyi Xiao 1 1 1 1 Equal contribution.   , Zechuan Wang 1 1 1 1 Equal contribution.   , Leilei Gan 1 1 1 1 Equal contribution.    2 2 2 Corresponding author.   , Shuai Zhao 2 , Zongyue Li 1 , Ruirui Lei 1 ,
 Wanggui He 3 , Luu Anh Tuan 2 , Long Chen 3 , Hao Jiang 3 , Zhou Zhao 1 , Fei Wu 1 
 1 Zhejiang University, China; 
 2 Nanyang Technological University, Singapore;
 3 Alibaba Group, China; 
 
 
 
 
 Abstract

 With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO’s various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO’s current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey .

 
 
 
 1 Introduction

 
 Through pre-training on extensive, high-quality corpora using the next token prediction objective with a huge amount of computational costs, Large Language Models (LLMs)   OpenAI ( 2022 ); Touvron et al. ( 2023a ); OpenAI ( 2024a ); Jiang et al. ( 2023 ) assimilate comprehensive world knowledge into their internal parameters, demonstrating impressive language understanding and generation abilities. Further, LLMs have been extended to accommodate multi-modality inputs, including both language and vision, thereby giving rise to Large Vision Language Models (LVLMs)   OpenAI ( 2023a ); Liu et al. ( 2024a ); Team et al. ( 2023 ); Bai et al. ( 2023 ) . These foundation models serve as a versatile solution and have achieved exceptional performance across a broad spectrum of both language and visio-language tasks, marking a significant milestone in the advancement toward artificial general intelligence.

 
 
 As these foundation models grow larger and more powerful, they are still found grappling with following the user’s instruction (explicit objective) and fulfilling the goals of being Helpful, Honest and Harmless (implicit objective), which attribute to the misaligned next token prediction task used in the pre-training stage  Leike et al. ( 2018 ); Askell et al. ( 2021 ); OpenAI ( 2023b ) . T

... (truncated, 98 KB total)
Resource ID: 573756885a2318ef | Stable ID: ZjI4YTY3Mm