Turner has expressed reservations

web

Alex Turner is a notable AI safety researcher whose work on instrumental convergence and power-seeking AI has been influential; this page serves as a hub for his published research and ongoing projects.

Metadata

Importance: 62/100homepage

Summary

This is the research homepage of Alex Turner (TurnTrout), an AI safety researcher known for work on instrumental convergence, power-seeking behavior, and corrigibility. The page likely catalogs his publications and research directions related to understanding and mitigating risks from misaligned AI systems.

Key Points

•Turner has published influential work on instrumental convergence and why AI systems may develop self-preservation drives
•Research focuses on power-seeking behavior in AI and formal frameworks for understanding corrigibility
•Turner has expressed reservations about certain assumptions in mainstream AI alignment approaches
•Work includes both theoretical analysis and practical implications for building safer AI systems
•Associated with DeepMind and the broader AI safety research community

Cited by 1 page

Page	Type	Quality
Power-Seeking AI	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202632 KB

My Research The Pond 

 

 Search Search Search

 Tags AI 
 About this post Read time: 22 minutes 
 Published on October 27 th , 2024 
 Updated on March 26 th , 2026 
 Links to this page Mistaken Claims I’ve Made 
 About Me 
 A Simple Explanation of agi Risk 
 The Pond 
 English Writes Numbers Backwards 
 Site Launch: Come Relax by The Pond! 
 My Research

 By Alex Turner

 Published on October 27 th , 2024 

 Table of Contents Low-impact AI Defining a new impact measure: aup 
 Scaling the aup technique to harder tasks 
 Reflections on impact measures 
 
 A formal theory of power-seeking tendencies Reflections on the power-seeking theory 
 
 Shard theory Looking back on shard theory 
 
 Mechanistic interpretability 
 Steering vectors Reflections on steering vector work 
 
 Consistency training (against internal model activations) 
 Selected research I’ve mentored Unsupervised capability elicitation 
 Gradient routing 
 Distillation robustifies unlearning 
 Output supervision can obfuscate the CoT 
 
 Footnotes 
 Over the years, I’ve worked on lots of research problems. Every time, I felt invested in my work. The work felt beautiful. Even though many days have passed since I have daydreamed about instrumental convergence, I’m proud of what I’ve accomplished and discovered.

 While not technically a part of my research, I’ve included a photo of myself anyways. 

 As of November 2023, I am a research scientist on Google DeepMind’s scalable alignment team in the Bay area. 1 I lead a mats mentorship team called “Team Shard.” If you want to break into the alignment field, consider applying to work with me. My Google Scholar is h ere. 

 This page is chronological. For my most recent work, navigate to the end of the p age! 

 Low-impact AI 

 Spring 2018 through June 2022

 Impact measures—my first (research) love. The hope was:

 
 It seemed hard to get AI to do exactly what we want (like cleaning a room);

 It seemed easier to flag down obviously “big deal” actions and penalize those (like making a mess);

 By getting the AI to optimize a “good enough” description of what we want, but also not taking impactful actions—we could still get useful work out of the AI.

 
 The question: What does it mean for an action to be a “big deal”? First, I needed to informally answer the question philosophically. Second, I needed to turn the answer into math.

 Defining a new impact measure: aup 

 After a flawed but fun first stab at the prob lem, I was itching to notch an AI safety win and find “the correct impact equation.” I felt inspired after a coffee chat with a friend, so I headed to the library, walked up to a whiteboard, and stared at its blank blankness. With Attack on Titan m usic beating its way through my heart, I stared until inspiration came over me and I simply wrote down a new equation. That new equation went on to become Attainable Utility Preservation ( aup ). 

 The key insight involved a frame shift. Exis ting work formalized impact as change in

... (truncated, 32 KB total)

Resource ID: d773c5dd9ea6b3c3 | Stable ID: sid_r4F5gQkMjS