Back
Turner has expressed reservations
webturntrout.com·turntrout.com/research
Alex Turner is a notable AI safety researcher whose work on instrumental convergence and power-seeking AI has been influential; this page serves as a hub for his published research and ongoing projects.
Metadata
Importance: 62/100homepage
Summary
This is the research homepage of Alex Turner (TurnTrout), an AI safety researcher known for work on instrumental convergence, power-seeking behavior, and corrigibility. The page likely catalogs his publications and research directions related to understanding and mitigating risks from misaligned AI systems.
Key Points
- •Turner has published influential work on instrumental convergence and why AI systems may develop self-preservation drives
- •Research focuses on power-seeking behavior in AI and formal frameworks for understanding corrigibility
- •Turner has expressed reservations about certain assumptions in mainstream AI alignment approaches
- •Work includes both theoretical analysis and practical implications for building safer AI systems
- •Associated with DeepMind and the broader AI safety research community
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Power-Seeking AI | Risk | 67.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 202632 KB
My Research The Pond
Search Search Search
Tags AI
About this post Read time: 22 minutes
Published on October 27 th , 2024
Updated on March 26 th , 2026
Links to this page Mistaken Claims I’ve Made
About Me
A Simple Explanation of agi Risk
The Pond
English Writes Numbers Backwards
Site Launch: Come Relax by The Pond!
My Research
By Alex Turner
Published on October 27 th , 2024
Table of Contents Low-impact AI Defining a new impact measure: aup
Scaling the aup technique to harder tasks
Reflections on impact measures
A formal theory of power-seeking tendencies Reflections on the power-seeking theory
Shard theory Looking back on shard theory
Mechanistic interpretability
Steering vectors Reflections on steering vector work
Consistency training (against internal model activations)
Selected research I’ve mentored Unsupervised capability elicitation
Gradient routing
Distillation robustifies unlearning
Output supervision can obfuscate the CoT
Footnotes
Over the years, I’ve worked on lots of research problems. Every time, I felt invested in my work. The work felt beautiful. Even though many days have passed since I have daydreamed about instrumental convergence, I’m proud of what I’ve accomplished and discovered.
While not technically a part of my research, I’ve included a photo of myself anyways.
As of November 2023, I am a research scientist on Google DeepMind’s scalable alignment team in the Bay area. 1 I lead a mats mentorship team called “Team Shard.” If you want to break into the alignment field, consider applying to work with me. My Google Scholar is h ere.
This page is chronological. For my most recent work, navigate to the end of the p age!
Low-impact AI
Spring 2018 through June 2022
Impact measures—my first (research) love. The hope was:
It seemed hard to get AI to do exactly what we want (like cleaning a room);
It seemed easier to flag down obviously “big deal” actions and penalize those (like making a mess);
By getting the AI to optimize a “good enough” description of what we want, but also not taking impactful actions—we could still get useful work out of the AI.
The question: What does it mean for an action to be a “big deal”? First, I needed to informally answer the question philosophically. Second, I needed to turn the answer into math.
Defining a new impact measure: aup
After a flawed but fun first stab at the prob lem, I was itching to notch an AI safety win and find “the correct impact equation.” I felt inspired after a coffee chat with a friend, so I headed to the library, walked up to a whiteboard, and stared at its blank blankness. With Attack on Titan m usic beating its way through my heart, I stared until inspiration came over me and I simply wrote down a new equation. That new equation went on to become Attainable Utility Preservation ( aup ).
The key insight involved a frame shift. Exis ting work formalized impact as change in
... (truncated, 32 KB total)Resource ID:
d773c5dd9ea6b3c3 | Stable ID: sid_r4F5gQkMjS