Skip to content
Longterm Wiki
Back

Safe RLHF: Safe Reinforcement Learning from Human Feedback

web

A peer-reviewed paper (OpenReview) proposing a constrained RLHF framework relevant to practitioners training safer language models; useful for those studying the helpfulness-harmlessness trade-off in LLM alignment.

Metadata

Importance: 68/100conference paperprimary source

Summary

Safe RLHF proposes a framework that explicitly decouples helpfulness and harmlessness in RLHF training by separately modeling reward and cost functions, then optimizing them via constrained reinforcement learning. This approach aims to balance the competing objectives of being helpful while avoiding harmful outputs, addressing a key tension in aligning language models. The method demonstrates improved safety-helpfulness trade-offs compared to standard RLHF.

Key Points

  • Separates reward (helpfulness) and cost (harmlessness) signals from human feedback rather than conflating them into a single reward model
  • Uses constrained RL optimization to maximize helpfulness subject to safety constraints, avoiding over-restriction of model capabilities
  • Introduces a human annotation protocol to collect preference data along both helpfulness and harmlessness dimensions simultaneously
  • Empirically shows that Safe RLHF achieves better Pareto-optimal trade-offs between helpfulness and safety than standard RLHF baselines
  • Addresses the fundamental tension in RLHF where safety and helpfulness objectives can conflict, requiring explicit constraint handling

Cited by 1 page

PageTypeQuality
RLHFResearch Area63.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20263 KB
Safe RLHF: Safe Reinforcement Learning from Human Feedback | OpenReview

 

 
 
 
 

 Jan
 FEB
 Mar
 

 
 

 
 17
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260217021214/https://openreview.net/forum?id=TyFrPOKYXw

 

Toggle navigationOpenReview.net

Login

Go to ICLR 2024 Conference homepage

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang 

Published: 16 Jan 2024, Last Modified: 17 Apr 2024ICLR 2024 spotlightEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Safe Reinforcement Learning, Reinforcement Learning from Human Feedback, Large Language Model, AI Safety

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Safe Reinforcement Learning from Human Feedback

Abstract: With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowd workers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.

Code is available at https://github.com/PKU-Alignment/safe-rlhf.

Warning: This paper contains example data that may be offensive or harmful.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in

... (truncated, 3 KB total)
Resource ID: cea0ecf0a1d00903 | Stable ID: MTcxODQ5ND