RLHF

Alignment Trainingmature

Reinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.

Organizations

Key Papers

Grants

Total Funding

$637K

Risks Addressed

First Proposed: 2017 (Christiano et al.)

Cluster: Alignment Training

Organizations4

Organization	Role
Anthropic	pioneer
Google DeepMind	active
Meta AI (FAIR)	active
OpenAI	pioneer

Grants4

Name	Recipient	Amount	Funder	Date
Compute and other expenses for LLM alignment research	Ethan Josean Perez	$400K	Manifund	2023-08-19
Grant to "support a NeurIPS competition applying human feedback in a non-language-model setting, specifically pretrained models in Minecraft."	Berkeley Existential Risk Initiative	$155K	FTX Future Fund	2022-05
Berkeley Existential Risk Initiative — MineRL BASALT Competition	Berkeley Existential Risk Initiative	$70K	Coefficient Giving	2021-07
4-month salary for a research visit with David Krueger on evaluating non-myopia in language models and RLHF systems	Alan Chan	$12K	Long-Term Future Fund (LTFF)	2022

Funding by Funder

Funder	Grants	Total Amount
Manifund	1	$400K
FTX Future Fund	1	$155K
Coefficient Giving	1	$70K
Long-Term Future Fund (LTFF)	1	$12K

Key Papers & Resources3

SEMINAL

Deep Reinforcement Learning from Human Preferences

Christiano et al.2017

SEMINAL

Training language models to follow instructions with human feedback

Ouyang et al. (OpenAI)2022

SEMINAL

Training a Helpful and Harmless Assistant with RLHF

Bai et al. (Anthropic)2022

RLHF

Organizations4

Grants4

Funding by Funder

Key Papers & Resources3

Tags