Back
comprehensive survey of over 250 papers
webmontrealethics.ai·montrealethics.ai/open-problems-and-fundamental-limitatio...
Data Status
Not fetched
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| RLHF | Capability | 63.0 |
Cached Content Preview
HTTP 200Fetched Feb 26, 202613 KB
- [Skip to main content](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-content)
- [Skip to secondary menu](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-mobile-nav-secondary)
- [Skip to primary sidebar](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-sidebar-primary)
- [Skip to footer](https://montrealethics.ai/open-problems-and-fundamental-limitations-of-reinforcement-learning-from-human-feedback/#genesis-footer-widgets)
Menu


🔬 Research Summary by **Stephen Casper**, an MIT PhD student working on AI interpretability, diagnostics, and safety.
\[ [Original paper](https://arxiv.org/abs/2307.15217) by Stephen Casper,\* Xander Davies,\* Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell\]
* * *
**Overview**: Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique for finetuning state-of-the-art AI systems such as GPT-4, Claude, Bard, and Llama-2. Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems.
* * *
## **Introduction**
Reinforcement Learning from Human Feedback (RLHF) is the key technique to train today’s most advanced language models, such as GPT-4, Claude, Bard, and Llama-2. In a matter of months, applications built on these systems have gained user bases well into the hundreds of millions. Given RLHF’s status as the default industry alignment technique, we should carefully evaluate its limitations. In a survey of over 250 papers, we review open challenges and fundamental limitations of RLHF with a focus on applications in large language models.
## **Key Insights**
### Contributions
1. **Concrete challenges with RLHF:** We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the po
... (truncated, 13 KB total)Resource ID:
0a13bac6af967fe8 | Stable ID: M2MzYTQzY2