Longterm Wiki
Back

Scalable Oversight

web

Data Status

Not fetched

Cited by 2 pages

PageTypeQuality
RLHFCapability63.0
Scalable OversightSafety Agenda68.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202632 KB
Learning from Feedback
Search.../

- [Contents](https://alignmentsurvey.com/materials/learning/scalable/#)

# Scalable Oversight

Scalable oversight seeks to ensure that AI systems, even those surpassing human expertise, remain aligned with human intent.

We are Here.

[Reinforcement Learning from Feedback](https://alignmentsurvey.com/materials/learning/scalable/#reinforcement-learning-from-feedback)[Scalable Oversight](https://alignmentsurvey.com/materials/learning/scalable)[Iterated Distillation and Amplification](https://alignmentsurvey.com/materials/learning/scalable/#iterated-distillation-and-amplification)[Recursive Reward Modeling](https://alignmentsurvey.com/materials/learning/scalable/#recursive-reward-modeling)[Debate](https://alignmentsurvey.com/materials/learning/scalable/#debate)[CIRL: Cooperative Inverse Reinforcement Learning](https://alignmentsurvey.com/materials/learning/scalable/#cirl-cooperative-inverse-reinforcement-learning)[Feedback](https://alignmentsurvey.com/materials/learning/feedback)

Policy Learning

[Preference Modeling](https://alignmentsurvey.com/materials/learning/preference)

## Reinforcement Learning from Feedback [Anchor](https://alignmentsurvey.com/materials/learning/scalable/\#reinforcement-learning-from-feedback)

We propose the concept of RLxF as a naive form of Scalable Oversight, achieved by extending RLHF via AI components, improving the efficiency and quality of alignment between humans and AI systems.

### Reinforcement Learning from AI Feedback [Anchor](https://alignmentsurvey.com/materials/learning/scalable/\#reinforcement-learning-from-ai-feedback)

Reinforcement Learning from AI Feedback (RLAIF) is a method building upon the framework of RLHF and serves as an extension to [Reinforcement Learning from Human Feedback (RLHF)](https://alignmentsurvey.com/materials/learning/policy/#reinforcement-learning-from-human-feedback).

![](https://github.com/PKU-Alignment/omnisafe/assets/108712610/9cad02fa-bd2a-4477-baf2-17e5a48714f6)**We show the basic steps of our Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.**

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

**Recommended Papers List**

- [Constitutional AI: Harmlessness from ai feedback](https://arxiv.org/pdf/2212.08073.pdf?trk=public_post_comment-text)
Click to have a preview.

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for traini

... (truncated, 32 KB total)
Resource ID: 14d1c8e3a3ef284b | Stable ID: MWViNmVmY2