Longterm Wiki
Back

Reward Shaping to Mitigate Reward Hacking in RLHF

paper

Authors

Fu, Jiayi·Zhao, Xuandong·Yao, Chengyuan·Wang, Heng·Han, Qi·Xiao, Yanghua

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Full text fetchedFetched Dec 28, 2025

Summary

A novel reward shaping approach called Preference As Reward (PAR) addresses reward hacking in reinforcement learning from human feedback by using latent preferences as a reward signal.

Key Points

  • PAR leverages latent preferences as a reward signal to mitigate reward hacking
  • The approach is highly data-efficient and demonstrates robust performance across different models
  • Two key design principles for reward shaping: bounded rewards and growth-convergence dynamics

Review

The paper addresses a critical challenge in AI alignment: reward hacking in Reinforcement Learning from Human Feedback (RLHF). While existing methods struggle to prevent models from exploiting reward function flaws, the authors propose Preference As Reward (PAR), a sophisticated approach that extracts reward signals from latent preferences within the reward model itself. The research systematically investigates reward shaping techniques, identifying two key design principles: reward boundedness and a reward structure that enables rapid initial growth followed by gradual convergence. By implementing PAR, the authors demonstrate significant improvements in model performance, achieving a 5 percentage point higher win rate on the AlpacaEval 2.0 benchmark compared to competing approaches. The method's remarkable data efficiency—requiring only a single reference reward—and robust resistance to reward hacking make it a promising contribution to AI alignment research, potentially offering a more reliable pathway to developing AI systems that more closely align with human values and intentions.

Cited by 1 page

PageTypeQuality
RLHFCapability63.0
Resource ID: d4e5b9bc7e21476c | Stable ID: YjViMTQ0Mz