Back
Reward Shaping to Mitigate Reward Hacking in RLHF
paperAuthors
Fu, Jiayi·Zhao, Xuandong·Yao, Chengyuan·Wang, Heng·Han, Qi·Xiao, Yanghua
Credibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Full text fetchedFetched Dec 28, 2025
Summary
A novel reward shaping approach called Preference As Reward (PAR) addresses reward hacking in reinforcement learning from human feedback by using latent preferences as a reward signal.
Key Points
- •PAR leverages latent preferences as a reward signal to mitigate reward hacking
- •The approach is highly data-efficient and demonstrates robust performance across different models
- •Two key design principles for reward shaping: bounded rewards and growth-convergence dynamics
Review
The paper addresses a critical challenge in AI alignment: reward hacking in Reinforcement Learning from Human Feedback (RLHF). While existing methods struggle to prevent models from exploiting reward function flaws, the authors propose Preference As Reward (PAR), a sophisticated approach that extracts reward signals from latent preferences within the reward model itself.
The research systematically investigates reward shaping techniques, identifying two key design principles: reward boundedness and a reward structure that enables rapid initial growth followed by gradual convergence. By implementing PAR, the authors demonstrate significant improvements in model performance, achieving a 5 percentage point higher win rate on the AlpacaEval 2.0 benchmark compared to competing approaches. The method's remarkable data efficiency—requiring only a single reference reward—and robust resistance to reward hacking make it a promising contribution to AI alignment research, potentially offering a more reliable pathway to developing AI systems that more closely align with human values and intentions.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| RLHF | Capability | 63.0 |
Resource ID:
d4e5b9bc7e21476c | Stable ID: YjViMTQ0Mz