RLHF
Alignment TrainingactiveReinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.
Organizations
4
Key Papers
3
Risks Addressed
3
First Proposed: 2017 (Christiano et al.)
Cluster: Alignment Training
Tags
function:specificationstage:trainingscope:technique
Organizations4
| Organization | Role |
|---|---|
| Anthropic | pioneer |
| Google DeepMind | active |
| Meta AI (FAIR) | active |
| OpenAI | pioneer |
Key Papers & Resources3
SEMINAL
Deep Reinforcement Learning from Human Preferences
Christiano et al.2017
SEMINAL
Training language models to follow instructions with human feedback
Ouyang et al. (OpenAI)2022
SEMINAL
Training a Helpful and Harmless Assistant with RLHF
Bai et al. (Anthropic)2022
Sub-Areas3
| Name | Status | Orgs | Papers |
|---|---|---|---|
| Constitutional AITraining methodology using explicit principles and AI-generated feedback (RLAIF) to train safer language models. | active | 1 | 1 |
| Direct Preference OptimizationFamily of reward-free alignment methods (DPO, KTO, IPO, ORPO, GRPO) that bypass explicit reward model training. | active | 3 | 1 |
| Reward ModelingTraining neural networks on human preference comparisons to provide scalable reward signals for RL fine-tuning. | active | 0 | 1 |