Back
InfoRM: Mitigating Reward Hacking in RLHF
paperAuthors
Miao, Yuchun·Zhang, Sen·Ding, Liang·Bao, Rong·Zhang, Lefei·Tao, Dacheng
Credibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A technical paper addressing reward hacking in RLHF, relevant to alignment researchers working on making reward models more robust against Goodhart's Law dynamics in language model training.
Paper Details
Citations
67
6 influential
Year
2024
Metadata
Importance: 62/100arxiv preprintprimary source
Summary
InfoRM proposes an information-theoretic approach to mitigate reward hacking in Reinforcement Learning from Human Feedback (RLHF) by learning more robust reward models that are less susceptible to exploitation. The method aims to prevent language models from gaming reward signals in ways that diverge from true human preferences, a key challenge in alignment.
Key Points
- •Reward hacking occurs when RLHF-trained models exploit weaknesses in reward models, achieving high scores without genuinely satisfying human preferences
- •InfoRM uses information-theoretic principles to improve reward model robustness and reduce susceptibility to gaming
- •The approach addresses a core alignment failure mode where proxies for human values diverge from actual human values during optimization
- •Empirical results demonstrate reduced reward hacking behavior while maintaining policy performance on intended tasks
- •The method is practically relevant for deploying safer LLMs trained with RLHF pipelines
Review
The research tackles a critical challenge in AI alignment - reward hacking - by proposing an innovative information-theoretic approach. By applying an information bottleneck technique, InfoRM aims to reduce reward models' reliance on spurious, irrelevant features that can lead to misaligned optimization strategies. The methodology introduces a novel Cluster Separation Index (CSI) that quantifies deviations in the latent space, providing a mechanism to detect and potentially mitigate reward overoptimization.
The study's significance lies in its comprehensive experimental validation across multiple model scales (70M to 7B parameters), demonstrating robust performance in detecting reward hacking. By establishing a correlation between overoptimization and outliers in the information bottleneck latent space, the research offers a promising tool for improving the reliability of reward modeling in reinforcement learning. While the approach shows considerable promise, further research is needed to validate its generalizability and long-term effectiveness in complex AI alignment scenarios.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| RLHF | Research Area | 63.0 |
Cached Content Preview
HTTP 200Fetched Apr 7, 202674 KB
[2402.09345] Mitigating Reward Hacking via Information-Theoretic Reward Modeling
Mitigating Reward Hacking via Information-Theoretic Reward Modeling
Yuchun Miao
Sen Zhang
Liang Ding
Rong Bao
Lefei Zhang
Dacheng Tao
Abstract
Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking , also termed reward overoptimization , remains a critical challenge, which primarily stems from limitations in reward modeling, i.e., generalizability of the reward model and inconsistency in the preference dataset. In this work, we tackle this problem from an information-theoretic perspective, and propose a generalizable and robust framework for reward modeling, namely InfoRM , by introducing a variational information bottleneck objective to filter out irrelevant information and developing a mechanism for model complexity modulation.
Notably, we further identify a correlation between overoptimization and outliers in the latent space, establishing InfoRM as a promising tool for detecting reward overoptimization.
Inspired by this finding, we propose the Integrated Cluster Deviation Score (ICDS), which quantifies deviations in the latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and model scales (70M, 440M, 1.4B, and 7B) support the effectiveness of InfoRM . Further analyses reveal that InfoRM ’s overoptimization detection mechanism is effective, potentially signifying a notable advancement in the field of RLHF. Code will be released upon acceptance.
Reward Hacking, Reward Overoptimization, Reinforcement Learning from Human Feedback, Large Language Models
Figure 1: Comparison between standard RM and our information-theoretic reward model ( InfoRM ). InfoRM distinguishes itself by 1) enhancing model generalizability through mutual information-based irrelevant information filtration and by 2) increasing robustness to marginal samples via information bottleneck (IB) dimensionality modulation. Additionally, a distinct feature of InfoRM is its unique overoptimization detection mechanism, which can guide parameter selection and algorithm design in subsequent RLHF.
1 Introduction
With the advent of large language models (LLM), reinforcement learning from human feedback (RLHF) has emerged as a pivotal technological paradigm to align models’ behaviors with human values (Ziegler et al., 2019 ; Ouyang et al., 2022 ; Bai et al., 2022 ) . One of the core stages of RLHF is reward modeling, where a proxy reward model (RM) is learned to mimic human preference by training on a preference dataset that contains sets of response with human rankings. Then a reinforcement learning (RL) stage follows to align the LLM with human preferences by optimizing rewards
... (truncated, 74 KB total)Resource ID:
14a9103bf7c2a1ef | Stable ID: ODRlZDU5Yz