Longterm Wiki
Back

Safe RLHF

web

Data Status

Not fetched

Cited by 1 page

PageTypeQuality
RLHFCapability63.0

Cached Content Preview

HTTP 200Fetched Feb 26, 20264 KB
[![back arrow](https://openreview.net/images/arrow_left.svg)Go to **ICLR 2024 Conference** homepage](https://openreview.net/group?id=ICLR.cc/2024/Conference "Venue Homepage")

## Safe RLHF: Safe Reinforcement Learning from Human Feedback

[![Download PDF](https://openreview.net/images/pdf_icon_blue.svg)](https://openreview.net/pdf?id=TyFrPOKYXw "Download PDF")

### [Josef Dai](https://openreview.net/profile?id=~Josef_Dai1 "~Josef_Dai1"), [Xuehai Pan](https://openreview.net/profile?id=~Xuehai_Pan1 "~Xuehai_Pan1"), [Ruiyang Sun](https://openreview.net/profile?id=~Ruiyang_Sun2 "~Ruiyang_Sun2"), [Jiaming Ji](https://openreview.net/profile?id=~Jiaming_Ji2 "~Jiaming_Ji2"), [Xinbo Xu](https://openreview.net/profile?id=~Xinbo_Xu1 "~Xinbo_Xu1"), [Mickel Liu](https://openreview.net/profile?id=~Mickel_Liu1 "~Mickel_Liu1"), [Yizhou Wang](https://openreview.net/profile?id=~Yizhou_Wang1 "~Yizhou_Wang1"), [Yaodong Yang](https://openreview.net/profile?id=~Yaodong_Yang1 "~Yaodong_Yang1")

Published: 16 Jan 2024, Last Modified: 17 Apr 2024ICLR 2024 spotlightEveryone[Revisions](https://openreview.net/revisions?id=TyFrPOKYXw)[BibTeX](https://openreview.net/forum?id=TyFrPOKYXw#)

**Code Of Ethics:** I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

**Keywords:** Safe Reinforcement Learning, Reinforcement Learning from Human Feedback, Large Language Model, AI Safety

**Submission Guidelines:** I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

**TL;DR:** Safe Reinforcement Learning from Human Feedback

**Abstract:** With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowd workers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human eva

... (truncated, 4 KB total)
Resource ID: cea0ecf0a1d00903 | Stable ID: MTcxODQ5ND