Back
InfoRM: Mitigating Reward Hacking in RLHF
paperAuthors
Miao, Yuchun·Zhang, Sen·Ding, Liang·Bao, Rong·Zhang, Lefei·Tao, Dacheng
Credibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Full text fetchedFetched Dec 28, 2025
Summary
A novel framework called InfoRM addresses reward misgeneralization in RLHF by introducing a variational information bottleneck objective to filter irrelevant reward features and detect overoptimization.
Key Points
- •Introduces an information bottleneck approach to mitigate reward hacking in RLHF
- •Proposes Cluster Separation Index (CSI) to detect reward overoptimization
- •Demonstrates effectiveness across multiple model scales from 70M to 7B parameters
Review
The research tackles a critical challenge in AI alignment - reward hacking - by proposing an innovative information-theoretic approach. By applying an information bottleneck technique, InfoRM aims to reduce reward models' reliance on spurious, irrelevant features that can lead to misaligned optimization strategies. The methodology introduces a novel Cluster Separation Index (CSI) that quantifies deviations in the latent space, providing a mechanism to detect and potentially mitigate reward overoptimization.
The study's significance lies in its comprehensive experimental validation across multiple model scales (70M to 7B parameters), demonstrating robust performance in detecting reward hacking. By establishing a correlation between overoptimization and outliers in the information bottleneck latent space, the research offers a promising tool for improving the reliability of reward modeling in reinforcement learning. While the approach shows considerable promise, further research is needed to validate its generalizability and long-term effectiveness in complex AI alignment scenarios.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| RLHF | Capability | 63.0 |
Resource ID:
14a9103bf7c2a1ef | Stable ID: ODRlZDU5Yz