Reward Shaping to Mitigate Reward Hacking in RLHF

paper

2024·arXiv·arxiv.org/abs/2502.18770

Authors

Fu, Jiayi·Zhao, Xuandong·Yao, Chengyuan·Wang, Heng·Han, Qi·Xiao, Yanghua

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Proposes Preference As Reward (PAR), a reward shaping method to address reward hacking in RLHF by using latent human preferences as reward signals, directly tackling a critical AI safety problem in alignment.

Paper Details

Citations

4 influential

Year

2025

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Summary

A novel reward shaping approach called Preference As Reward (PAR) addresses reward hacking in reinforcement learning from human feedback by using latent preferences as a reward signal.

Key Points

•PAR leverages latent preferences as a reward signal to mitigate reward hacking
•The approach is highly data-efficient and demonstrates robust performance across different models
•Two key design principles for reward shaping: bounded rewards and growth-convergence dynamics

Review

The paper addresses a critical challenge in AI alignment: reward hacking in Reinforcement Learning from Human Feedback (RLHF). While existing methods struggle to prevent models from exploiting reward function flaws, the authors propose Preference As Reward (PAR), a sophisticated approach that extracts reward signals from latent preferences within the reward model itself. The research systematically investigates reward shaping techniques, identifying two key design principles: reward boundedness and a reward structure that enables rapid initial growth followed by gradual convergence. By implementing PAR, the authors demonstrate significant improvements in model performance, achieving a 5 percentage point higher win rate on the AlpacaEval 2.0 benchmark compared to competing approaches. The method's remarkable data efficiency—requiring only a single reference reward—and robust resistance to reward hacking make it a promising contribution to AI alignment research, potentially offering a more reliable pathway to developing AI systems that more closely align with human values and intentions.

Cited by 1 page

Page	Type	Quality
RLHF	Research Area	63.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202674 KB

[2502.18770] Reward Shaping to Mitigate Reward Hacking in RLHF 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Reward Shaping to Mitigate Reward Hacking in RLHF

 
 
 
Jiayi Fu 1,2 ,
Xuandong Zhao 3 1 1 footnotemark: 1 ,
Chengyuan Yao 2 ,
Heng Wang 2 ,
Qi Han 2 ,
Yanghua Xiao 1 
 1 Fudan University   2 StepFun   3 UC Berkeley 
 fujy22@m.fudan.edu.cn 
 xuandongzhao@berkeley.edu 
 shawyh@fudan.edu.cn 
 Equal contribution.Corresponding author. 
 

 
 Abstract

 Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to reward hacking , where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. While reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests three key design principles: (1) RL reward is ideally bounded, (2) RL benefits from rapid initial growth followed by gradual convergence, and (3) RL reward is best formulated as a function of centered reward. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model itself as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR’s superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. Code is available at https://github.com/PorUna-byte/PAR . 1 1 1 Work done during internship at StepFun by Jiayi Fu. 

 
 
 
 Reward Shaping to Mitigate Reward Hacking in RLHF 

 
 
 
 
 
Jiayi Fu 1,2 † † thanks: Equal contribution. ,
Xuandong Zhao 3 1 1 footnotemark: 1 ,
Chengyuan Yao 2 ,
Heng Wang 2 ,
Qi Han 2 ,
Yanghua Xiao 1 † † thanks: Corresponding author. 
 
 1 Fudan University   2 StepFun   3 UC Berkeley 
 
 fujy22@m.fudan.edu.cn 
 
 xuandongzhao@berkeley.edu 
 
 shawyh@fudan.edu.cn 
 

 
 
 
 
 
 1 Introduction

 
 Figure 1: RLHF training pipeline with reward shaping. Responses from the policy model are evaluated by the reward model, producing proxy rewards. These rewards are then reshaped (optionally using reference rewards, as shown in the dashed box) before being used to update the policy via RL. The blue box details the PAR reward shaping function, which uses a sigmoid applied to the centered reward. 
 
 
 Reinforcement learning from human feedback is essential for the 

... (truncated, 74 KB total)

Resource ID: d4e5b9bc7e21476c | Stable ID: YjViMTQ0Mz