Research from Owain Evans and colleagues

web

Part of The Inside View interview series by Michaël Trazzi, featuring conversations with AI safety researchers; Owain Evans is known for foundational work on honest AI and eliciting latent knowledge at ARC and Oxford.

Metadata

Importance: 58/100homepagecommentary

Summary

An interview with Owain Evans, AI safety researcher known for work on scalable oversight, reward modeling, and value alignment. The discussion likely covers his research agenda on eliciting latent knowledge, honest AI, and approaches to ensuring AI systems behave safely and according to human values.

Key Points

•Owain Evans is a prominent AI safety researcher at Oxford/ARC focused on scalable oversight and honest AI behavior
•His work includes research on reward modeling, eliciting latent knowledge (ELK), and detecting deceptive alignment
•The Inside View series provides in-depth interviews with leading AI safety researchers about their work and perspectives
•Key themes likely include how to verify AI honesty and whether AI systems believe what they say
•Research explores methods for humans to maintain oversight as AI systems become more capable than human evaluators

Cited by 1 page

Page	Type	Quality
AI Safety Technical Pathway Decomposition	Analysis	62.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202684 KB

Owain Evans on Situational Awareness 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 .. 
 
 2024-08-23 
 

 Owain Evans on Situational Awareness

 
 
 
 
 
 
 

 Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group.

 In this episode we discuss two of his recent papers, “ Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs ” and “ Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data ”, alongside some Twitter questions .

 (Our conversation is ~2h15 long, so feel free to click on any sub-topic of your liking in the Outline below. At any point you can come back by clicking on the up-arrow ⬆ at the end of sections) 

 Contents

 
 Highlighted 

 Me Myself and AI: The Situational Awareness Dataset for LLMs 
 
 Defining Situational Awareness 

 Motivation for the paper in terms of safety 

 Motivation for the Situational Awareness Dataset 

 Risks in Releasing the Dataset 

 Owain’s Reaction to Claude 3 Opus Situational Awareness on the Longform task 

 Connection to the Needle in a Haystack Pizza Experiment 

 The Situating Prompt 

 Connections Between Situational Awareness and Deceptive Alignment 

 Situational Awareness As Almost Necessary To Get Deceptive Alignment 

 Forcing a Distribution Over Two Random Words 

 Discontinuing a Sequence of Fifty 01s 

 GPT-4 Has Non-Zero Performance On The Longform Task 

 There Probably Was Not A Lot Of Human-AI Conversations In GPT-4’s Pretraining Data 

 Are The Questions For The Longform Task Unusual To Ask An Human? 

 When Will The Situational Awareness Dataset Benchmark Be Saturated? 

 Safety And Governance Implications If The Situation Awareness Benchmark Becomes Saturated 

 Implications For Evaluations If The Benchmark Is Saturated 

 Follow-up Work Owain Suggests Doing 

 Should We Remove Chain-Of-Thought Altogether? 

 
 

 Out-of-Context Reasoning 
 
 What Is Out-Of-Context Reasoning 

 Experimental Setup 

 Concrete Example Of Out-Of-Context Reasoning: 3x + 1 

 How Do We Know It’s Not A Simple Mapping From Something Which Already Existed? 

 Motivation For Out-Of-Context Reasoning In Terms Of Safety 

 Are The Out-Of-Context Reasoning Results Surprising At All? 

 The Biased Coin Task 

 Will Out-Of-Context Reasoning Continue To Scale? 

 Checking In-Context Learning Abilities Before Scaling 

 Should We Be Worried About The Mixture-Of-Functions Results? 

 Could Models Infer New Architctures From ArXiv With Out-Of-Context Reasoning? 

 
 

 Twitter Questions 
 
 How Does Owain Come Up With Ideas 

 How Owain’s Background Influenced His Research Style And Taste 

 Should AI Alignment Researchers Aim For Publication 

 How Can We Apply LLM Understanding To Mitigate Deceptive Alignment? 

 Could Owain’s Research Accelerate Capabilities? 

 How Was Owain’s Work Been Received at AI Labs and in Academia 

 Last Message to the Audience

... (truncated, 84 KB total)

Resource ID: f0e47fd7657fd428 | Stable ID: sid_EZTinhoF7W