Longterm Wiki

RLHF

Alignment Trainingactive

Reinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.

Organizations
4
Key Papers
3
Risks Addressed
3
First Proposed: 2017 (Christiano et al.)
Cluster: Alignment Training

Tags

function:specificationstage:trainingscope:technique

Organizations4

OrganizationRole
Anthropicpioneer
Google DeepMindactive
Meta AI (FAIR)active
OpenAIpioneer

Key Papers & Resources3

Sub-Areas3

NameStatusOrgsPapers
Constitutional AITraining methodology using explicit principles and AI-generated feedback (RLAIF) to train safer language models.active11
Direct Preference OptimizationFamily of reward-free alignment methods (DPO, KTO, IPO, ORPO, GRPO) that bypass explicit reward model training.active31
Reward ModelingTraining neural networks on human preference comparisons to provide scalable reward signals for RL fine-tuning.active01