Jan Leike – Personal Website

web

jan.leike.name·jan.leike.name/

Homepage of one of the most prominent alignment researchers in the field, useful as a navigation hub to his publications, blog, and interviews covering RLHF, scalable oversight, and superalignment.

Metadata

Importance: 62/100homepage

Summary

Personal website of Jan Leike, a leading AI alignment researcher currently heading the Alignment Science team at Anthropic and formerly co-leading OpenAI's Superalignment Team. The site outlines his research focus on scalable oversight, weak-to-strong generalization, and automated alignment researchers, and links to key publications including InstructGPT and RLHF foundational work.

Key Points

•Jan Leike leads Alignment Science at Anthropic; previously co-led OpenAI's Superalignment Team and pioneered RLHF at DeepMind.
•Core research question: how to train AI systems to follow human intent on tasks difficult for humans to evaluate directly.
•Key research areas include scalable oversight, weak-to-strong generalization, automated alignment researchers, and jailbreak robustness.
•Co-authored foundational papers: InstructGPT (RLHF for LLMs), weak-to-strong generalization, and reward modeling research directions.
•Named to TIME 100 AI list in both 2023 and 2024; maintains a blog at aligned.substack.com.

Cited by 3 pages

Page	Type	Quality
Jan Leike	Person	27.0
Scalable Oversight	Research Area	68.0
Optimistic Alignment Worldview	Concept	91.0

2 FactBase facts citing this source

Entity	Property	Value	As Of
Jan Leike	Role / Title	VP of Alignment Science, Anthropic	May 2024
Jan Leike	Education	PhD in Machine Learning, Australian National University	—

Cached Content Preview

HTTP 200Fetched Apr 7, 20263 KB

Jan Leike 
 
 

 
 
 
 
 
 Jan Leike

 Machine learning & alignment researcher

 Optimizing for a post-AGI future where humanity flourishes 

 jan@anthropic.com 

 @janleike 

 blog 

 publications 

 
 
 About me

 I lead the Alignment Science team at Anthropic.
Previously, I co-led the Superalignment
Team at OpenAI, where I’ve been involved in the development of InstructGPT , ChatGPT , and the alignment of GPT-4 .
I developed OpenAI’s approach to alignment research and co-authored the Superalignment Team’s research roadmap.
Prior to OpenAI, I was an alignment researcher at DeepMind where I prototyped reinforcement learning from human feedback .
I hold a PhD in reinforcement learning theory from the Australian National University.
In 2023 and 2024 TIME magazine listed me as one of the 100 most influential people in AI.

 My Research

 My research aims to solve
 the hard problem of alignment :

 How can we train AI systems to follow human intent on tasks that are difficult for humans to evaluate directly? 

 My team at Anthropic is researching how to align an automated alignment researcher , working on scalable oversight , weak-to-strong generalization , and robustness to jailbreaks.

 Read more:

 
 The podcast interview with 80,000 hours  ( video ) is currently the best introduction into my thinking in podcast form, especially if you’re coming from machine learning.

 The podcast interview with Daniel Filan goes into more technical questions, and should be interesting for those already somewhat familiar with alignment research.

 My blog about alignment research .

 

 Selected Publications

 
 LLM Critics Help Catch LLM Bugs 

Nat McAleese, Rai Michael Pokorny, Juan Felipe Cerón Uribe, Evgenia Nitishinskaya, Maja Trebacz, Jan Leike. 2024.

 Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision 

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu. International Conference on Machine Learning, 2024.

 Training language models to follow instructions with human feedback 

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Neural Information Processing Systems, 2022.

 Scalable agent alignment via reward modeling: a research direction 

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. 2018.

 Deep Reinforcement Learning from Human Preferences 

Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei.
Neural Information Processing Systems, 2017.

 Nonparametric General Reinforcement Learning .

Jan Leike.
PhD Thesis, supervised by Marcus Hutter, 2016.

 

 For more details, see my publication list and m

... (truncated, 3 KB total)

Resource ID: 2a84eb0982d4de6a | Stable ID: sid_GlLngss3oh