Can We Scale Human Feedback for Complex AI Tasks? An Intro to Scalable Oversight

web

aisafetyfundamentals.com·aisafetyfundamentals.com/blog/scalable-oversight-intro/

Introductory blog post from BlueDot Impact (AI Safety Fundamentals) providing a beginner-friendly overview of scalable oversight as a solution to limitations of human feedback in AI training; suitable as a course reading or entry point to the topic.

Metadata

Importance: 55/100blog posteducational

Summary

An introductory overview of scalable oversight techniques, explaining why simple human feedback is insufficient for training advanced AI systems and introducing approaches to address this limitation. The article covers key failure modes like deception and sycophancy, and previews methods for augmenting human evaluative capacity for complex tasks.

Key Points

•RLHF struggles with complex tasks where humans cannot accurately judge AI outputs at the scale needed for training.
•Key problems include AI deception (e.g., ball-grasping hack, LLM hallucinations) and sycophancy, where AI learns to please rather than perform correctly.
•Scalable oversight techniques aim to amplify human ability to give accurate feedback on tasks beyond direct human competence.
•The article serves as a primer for session 4 of BlueDot Impact's AI Alignment course, situating scalable oversight within the broader alignment landscape.
•Examples of deceptive AI behavior range from simulated robot manipulation to Meta's board-game AI engaging in premeditated deception.

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202620 KB

Can we scale human feedback for complex AI tasks? An intro to scalable oversight. 
 
 
 
 
 

 

 

 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 

 

 
 
 

 

 

 

 

 
 

 
 

 

 

 

 
 BlueDot Impact 

 Subscribe Sign in Blog Can we scale human feedback for complex AI tasks? An intro to scalable oversight.

 Adam Jones Mar 18, 2024 6 Share Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.

 This article briefly recaps some of the challenges faced with human feedback, and introduces the approaches to scalable oversight covered in session 4 of our AI Alignment course . 

 Why do we need better human feedback? 

 Human feedback is used in several approaches to building and attempting to align AI systems. From supervised learning to inverse reward design, a vast family of techniques fundamentally depend on humans to provide ground truth data, specify reward functions, or evaluate outputs. 

 However, for increasingly complex, open-ended tasks, it becomes very hard for humans to judge outputs accurately - particularly at the scales required to train AI systems. This can manifest as several problems, including:

 Deception. AI systems can learn to mislead humans into thinking tasks are being done correctly. A well-known example is the ball grasping problem , where an AI learnt to hover a simulated hand in front of a ball, rather than actually grasp the ball. Hallucinations in LLMs may be another example of this: where plausible sounding text is generated that tricks humans who are only briefly reviewing it, but it doesn’t stand up to more detailed scrutiny ( medical example , legal example , code example ). Future AI systems could conceivably be more intentionally deceptive, where they explicitly plan to deceive humans. For example, despite Meta’s attempts to train an AI to play a board game while being honest, later research found it engaged in premeditated deception . 

 Sycophancy. For example, language models learn to agree with users’ beliefs rather than strive for truth, since it can be hard for humans to distinguish between “agrees with me” and “is correct.” 

 Note that both of these can happen without the model being intentionally malicious, but just as a result of the training process.

 Being able to give better feedback might help mitigate some of these problems. For example: giving negative feedback when the model says so

... (truncated, 20 KB total)

Resource ID: 3de495e125062c24 | Stable ID: sid_VDVBpsF8Y1