InfoRM: Mitigating Reward Hacking in RLHF

paper

2024·arXiv·arxiv.org/abs/2402.09345

Authors

Miao, Yuchun·Zhang, Sen·Ding, Liang·Bao, Rong·Zhang, Lefei·Tao, Dacheng

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A technical paper addressing reward hacking in RLHF, relevant to alignment researchers working on making reward models more robust against Goodhart's Law dynamics in language model training.

Paper Details

Citations

6 influential

Year

2024

arXiv:2402.09345 DOI:10.52202/079017-4270 Semantic Scholar

Metadata

Importance: 62/100arxiv preprintprimary source

Summary

InfoRM proposes an information-theoretic approach to mitigate reward hacking in Reinforcement Learning from Human Feedback (RLHF) by learning more robust reward models that are less susceptible to exploitation. The method aims to prevent language models from gaming reward signals in ways that diverge from true human preferences, a key challenge in alignment.

Key Points

•Reward hacking occurs when RLHF-trained models exploit weaknesses in reward models, achieving high scores without genuinely satisfying human preferences
•InfoRM uses information-theoretic principles to improve reward model robustness and reduce susceptibility to gaming
•The approach addresses a core alignment failure mode where proxies for human values diverge from actual human values during optimization
•Empirical results demonstrate reduced reward hacking behavior while maintaining policy performance on intended tasks
•The method is practically relevant for deploying safer LLMs trained with RLHF pipelines

Review

The research tackles a critical challenge in AI alignment - reward hacking - by proposing an innovative information-theoretic approach. By applying an information bottleneck technique, InfoRM aims to reduce reward models' reliance on spurious, irrelevant features that can lead to misaligned optimization strategies. The methodology introduces a novel Cluster Separation Index (CSI) that quantifies deviations in the latent space, providing a mechanism to detect and potentially mitigate reward overoptimization. The study's significance lies in its comprehensive experimental validation across multiple model scales (70M to 7B parameters), demonstrating robust performance in detecting reward hacking. By establishing a correlation between overoptimization and outliers in the information bottleneck latent space, the research offers a promising tool for improving the reliability of reward modeling in reinforcement learning. While the approach shows considerable promise, further research is needed to validate its generalizability and long-term effectiveness in complex AI alignment scenarios.

Cited by 1 page

Page	Type	Quality
RLHF	Research Area	63.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202674 KB

[2402.09345] Mitigating Reward Hacking via Information-Theoretic Reward Modeling 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Mitigating Reward Hacking via Information-Theoretic Reward Modeling

 
 
 Yuchun Miao
 
    
 Sen Zhang
 
    
 Liang Ding
 
    
 Rong Bao
 
    
 Lefei Zhang
 
    
 Dacheng Tao
 
 

 
 Abstract

 Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking , also termed reward overoptimization , remains a critical challenge, which primarily stems from limitations in reward modeling, i.e., generalizability of the reward model and inconsistency in the preference dataset. In this work, we tackle this problem from an information-theoretic perspective, and propose a generalizable and robust framework for reward modeling, namely InfoRM , by introducing a variational information bottleneck objective to filter out irrelevant information and developing a mechanism for model complexity modulation.
Notably, we further identify a correlation between overoptimization and outliers in the latent space, establishing InfoRM as a promising tool for detecting reward overoptimization.
Inspired by this finding, we propose the Integrated Cluster Deviation Score (ICDS), which quantifies deviations in the latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and model scales (70M, 440M, 1.4B, and 7B) support the effectiveness of InfoRM . Further analyses reveal that InfoRM ’s overoptimization detection mechanism is effective, potentially signifying a notable advancement in the field of RLHF. Code will be released upon acceptance.

 
 Reward Hacking, Reward Overoptimization, Reinforcement Learning from Human Feedback, Large Language Models
 
 
 
 
 
 
 
 
 
 
 
 
 Figure 1: Comparison between standard RM and our information-theoretic reward model ( InfoRM ). InfoRM distinguishes itself by 1) enhancing model generalizability through mutual information-based irrelevant information filtration and by 2) increasing robustness to marginal samples via information bottleneck (IB) dimensionality modulation. Additionally, a distinct feature of InfoRM is its unique overoptimization detection mechanism, which can guide parameter selection and algorithm design in subsequent RLHF. 
 
 
 
 1 Introduction

 
 With the advent of large language models (LLM), reinforcement learning from human feedback (RLHF) has emerged as a pivotal technological paradigm to align models’ behaviors with human values (Ziegler et al., 2019 ; Ouyang et al., 2022 ; Bai et al., 2022 ) . One of the core stages of RLHF is reward modeling, where a proxy reward model (RM) is learned to mimic human preference by training on a preference dataset that contains sets of response with human rankings. Then a reinforcement learning (RL) stage follows to align the LLM with human preferences by optimizing rewards

... (truncated, 74 KB total)

Resource ID: 14a9103bf7c2a1ef | Stable ID: ODRlZDU5Yz