Back
Research from Owain Evans and colleagues
webtheinsideview.ai·theinsideview.ai/owain
Part of The Inside View interview series by Michaël Trazzi, featuring conversations with AI safety researchers; Owain Evans is known for foundational work on honest AI and eliciting latent knowledge at ARC and Oxford.
Metadata
Importance: 58/100homepagecommentary
Summary
An interview with Owain Evans, AI safety researcher known for work on scalable oversight, reward modeling, and value alignment. The discussion likely covers his research agenda on eliciting latent knowledge, honest AI, and approaches to ensuring AI systems behave safely and according to human values.
Key Points
- •Owain Evans is a prominent AI safety researcher at Oxford/ARC focused on scalable oversight and honest AI behavior
- •His work includes research on reward modeling, eliciting latent knowledge (ELK), and detecting deceptive alignment
- •The Inside View series provides in-depth interviews with leading AI safety researchers about their work and perspectives
- •Key themes likely include how to verify AI honesty and whether AI systems believe what they say
- •Research explores methods for humans to maintain oversight as AI systems become more capable than human evaluators
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Safety Technical Pathway Decomposition | Analysis | 62.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 202684 KB
Owain Evans on Situational Awareness
..
2024-08-23
Owain Evans on Situational Awareness
Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group.
In this episode we discuss two of his recent papers, “ Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs ” and “ Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data ”, alongside some Twitter questions .
(Our conversation is ~2h15 long, so feel free to click on any sub-topic of your liking in the Outline below. At any point you can come back by clicking on the up-arrow ⬆ at the end of sections)
Contents
Highlighted
Me Myself and AI: The Situational Awareness Dataset for LLMs
Defining Situational Awareness
Motivation for the paper in terms of safety
Motivation for the Situational Awareness Dataset
Risks in Releasing the Dataset
Owain’s Reaction to Claude 3 Opus Situational Awareness on the Longform task
Connection to the Needle in a Haystack Pizza Experiment
The Situating Prompt
Connections Between Situational Awareness and Deceptive Alignment
Situational Awareness As Almost Necessary To Get Deceptive Alignment
Forcing a Distribution Over Two Random Words
Discontinuing a Sequence of Fifty 01s
GPT-4 Has Non-Zero Performance On The Longform Task
There Probably Was Not A Lot Of Human-AI Conversations In GPT-4’s Pretraining Data
Are The Questions For The Longform Task Unusual To Ask An Human?
When Will The Situational Awareness Dataset Benchmark Be Saturated?
Safety And Governance Implications If The Situation Awareness Benchmark Becomes Saturated
Implications For Evaluations If The Benchmark Is Saturated
Follow-up Work Owain Suggests Doing
Should We Remove Chain-Of-Thought Altogether?
Out-of-Context Reasoning
What Is Out-Of-Context Reasoning
Experimental Setup
Concrete Example Of Out-Of-Context Reasoning: 3x + 1
How Do We Know It’s Not A Simple Mapping From Something Which Already Existed?
Motivation For Out-Of-Context Reasoning In Terms Of Safety
Are The Out-Of-Context Reasoning Results Surprising At All?
The Biased Coin Task
Will Out-Of-Context Reasoning Continue To Scale?
Checking In-Context Learning Abilities Before Scaling
Should We Be Worried About The Mixture-Of-Functions Results?
Could Models Infer New Architctures From ArXiv With Out-Of-Context Reasoning?
Twitter Questions
How Does Owain Come Up With Ideas
How Owain’s Background Influenced His Research Style And Taste
Should AI Alignment Researchers Aim For Publication
How Can We Apply LLM Understanding To Mitigate Deceptive Alignment?
Could Owain’s Research Accelerate Capabilities?
How Was Owain’s Work Been Received at AI Labs and in Academia
Last Message to the Audience
... (truncated, 84 KB total)Resource ID:
f0e47fd7657fd428 | Stable ID: sid_EZTinhoF7W