Skip to content
Longterm Wiki
Back

AI Alignment: Why It's Hard, and Where to Start

web

Author

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: MIRI

A foundational introductory talk by Eliezer Yudkowsky (MIRI) presenting the core framing of AI alignment as a technical problem, suitable as an entry point for researchers new to the field.

Metadata

Importance: 78/100blog posteducational

Summary

Eliezer Yudkowsky's 2016 Stanford talk introducing the AI alignment problem, covering why coherent advanced AI systems imply utility functions, key technical subproblems (low-impact agents, corrigibility, stable goals under self-modification), and why alignment is both necessary and difficult. The talk also discusses lessons from analogous engineering fields and provides entry points for researchers new to the field.

Key Points

  • Coherent decision-making agents implicitly have utility functions, making goal specification and alignment a fundamental technical challenge.
  • Key alignment subproblems include low-impact agents, interruptibility/corrigibility (suspend buttons), and maintaining stable goals through self-modification.
  • Alignment is hard because small misspecifications in goals can lead to catastrophic outcomes at high capability levels.
  • Lessons from NASA and cryptography suggest that safety-critical systems require rigorous theoretical foundations before deployment, not just empirical iteration.
  • Provides an accessible overview and reading list for researchers looking to enter the AI alignment field.

Cited by 1 page

PageTypeQuality
AI Doomer WorldviewConcept38.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202655 KB
AI Alignment: Why It's Hard, and Where to Start - Machine Intelligence Research Institute 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 

 
 
 
 
 
 
 
 
 

 
 
 
 
 Skip to content 

 
 
 
 
 
 
 
 
 
 
 
 
 AI Alignment: Why It’s Hard, and Where to Start

 
 
 
 
 
 
 
 
 December 28, 2016 
 
 

 
 
 
 Eliezer Yudkowsky 
 
 

 
 
 
 
 
 Back in May, I gave a talk at Stanford University for the Symbolic Systems Distinguished Speaker series, titled “ The AI Alignment Problem: Why It’s Hard, And Where To Start .” The video for this talk is now available on Youtube:

 

 

 

 We have an approximately complete transcript of the talk and Q&A session here , slides here , and notes and references here . You may also be interested in a shorter version of this talk I gave at NYU in October, “ Fundamental Difficulties in Aligning Advanced AI .”

 In the talk, I introduce some open technical problems in AI alignment and discuss the bigger picture into which they fit, as well as what it’s like to work in this relatively new field. Below, I’ve provided an abridged transcript of the talk, with some accompanying slides.

 Talk outline:

 
 1. Agents and their utility functions 

 1.1. Coherent decisions imply a utility function 

1.2. Filling a cauldron 

 2. Some AI alignment subproblems 

 2.1. Low-impact agents 

2.2. Agents with suspend buttons 

2.3. Stable goals in self-modification 

 3. Why expect difficulty? 

 3.1. Why is alignment necessary? 

3.2. Why is alignment hard? 

3.3. Lessons from NASA and cryptography 

 4. Where we are now 

 4.1. Recent topics 

4.2. Older work and basics 

4.3. Where to start 

 
 

 
 

 Agents and their utility functions

 In this talk, I’m going to try to answer the frequently asked question, “Just what is it that you do all day long?” We are concerned with the theory of artificial intelligences that are advanced beyond the present day, and that make sufficiently high-quality decisions in the service of whatever goals they may have been programmed with to be objects of concern .

 Coherent decisions imply a utility function

 The classic initial stab at this was taken by Isaac Asimov with the Three Laws of Robotics, the first of which is: “A robot may not injure a human being or, through inaction, allow a human being to come to harm.”

 And as Peter Norvig observed, the other laws don’t matter—because there will always be some tiny possibility that a human being could come to harm.

 Artificial Intelligence: A Modern Approach has a final chapter that asks, “Well, what if we succeed? What if the AI project actually works?” and observes, “We don’t want our robots to prevent a human from crossing the street because of the non-zero chance of harm.”

 To begin with, I’d like to explain the truly basic reason why the three laws aren’t even on the table—and that is because they’re not

... (truncated, 55 KB total)
Resource ID: 372cee55e4b03787 | Stable ID: sid_pc9U3XwMlH