Longterm Wiki
Back

InfoRM: Mitigating Reward Hacking in RLHF

paper

Authors

Miao, Yuchun·Zhang, Sen·Ding, Liang·Bao, Rong·Zhang, Lefei·Tao, Dacheng

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Full text fetchedFetched Dec 28, 2025

Summary

A novel framework called InfoRM addresses reward misgeneralization in RLHF by introducing a variational information bottleneck objective to filter irrelevant reward features and detect overoptimization.

Key Points

  • Introduces an information bottleneck approach to mitigate reward hacking in RLHF
  • Proposes Cluster Separation Index (CSI) to detect reward overoptimization
  • Demonstrates effectiveness across multiple model scales from 70M to 7B parameters

Review

The research tackles a critical challenge in AI alignment - reward hacking - by proposing an innovative information-theoretic approach. By applying an information bottleneck technique, InfoRM aims to reduce reward models' reliance on spurious, irrelevant features that can lead to misaligned optimization strategies. The methodology introduces a novel Cluster Separation Index (CSI) that quantifies deviations in the latent space, providing a mechanism to detect and potentially mitigate reward overoptimization. The study's significance lies in its comprehensive experimental validation across multiple model scales (70M to 7B parameters), demonstrating robust performance in detecting reward hacking. By establishing a correlation between overoptimization and outliers in the information bottleneck latent space, the research offers a promising tool for improving the reliability of reward modeling in reinforcement learning. While the approach shows considerable promise, further research is needed to validate its generalizability and long-term effectiveness in complex AI alignment scenarios.

Cited by 1 page

PageTypeQuality
RLHFCapability63.0
Resource ID: 14a9103bf7c2a1ef | Stable ID: ODRlZDU5Yz