Jan Leike

Person

Jan Leike

Biography of Jan Leike covering his career from Australian National University through DeepMind, OpenAI's Superalignment team, to his current role as VP of Alignment Science at Anthropic. Documents his research on RLHF and scalable oversight, his May 2024 departure from OpenAI, and his current research priorities including weak-to-strong generalization and automated alignment techniques.

Wikidata

AffiliationAnthropic

RoleVP of Alignment Science, Anthropic

Known ForAlignment research, scalable oversight, RLHF, superalignment work

ProfileView profile page

Websiteanthropic.com

Research Areas

2.6k words · 30 backlinks

Quick Assessment

Dimension	Assessment
Primary Role	VP of Alignment Science at Anthropic (2024–present)
Key Contributions	Co-authored early RLHF research; led the Agent Alignment Team at Google DeepMind; co-led OpenAI's Superalignment team; developed Reward Modeling frameworks
Key Publications	"Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017); "Scalable agent alignment via reward modeling" (arXiv 2018); "AI Safety Gridworlds" (arXiv 2017); "Recursively Summarizing Books with Human Feedback" (arXiv 2021)
Career Trajectory	PhD, Australian National University (2016) → FHI postdoc (2016) → Senior Research Scientist, Google DeepMind (2016–2021) → Head of Alignment / Superalignment co-lead, OpenAI (January 2021 – May 2024) → Anthropic (2024–present)
Notable Event	Departed OpenAI on May 16, 2024; posted publicly on X about his stated reasons for leaving

Overview

Jan Leike is an AI alignment researcher who has held senior roles at Google DeepMind, OpenAI, and Anthropic. He completed a PhD in reinforcement learning theory at Australian National University in 2016 under the supervision of Marcus Hutter, and subsequently held a brief research fellowship at the Future of Humanity Institute. At DeepMind, he led the Agent Alignment Team and contributed to early RLHF research. He joined OpenAI in January 2021 to lead alignment research, and in July 2023 co-led the formation of the Superalignment team alongside Ilya Sutskever, with a stated goal of solving Superintelligence within four years.¹ He departed OpenAI on May 16, 2024, posting a public thread on X explaining his stated reasons for leaving.² He subsequently joined Anthropic, where he heads the Alignment Science team.³ TIME magazine listed him among the 100 most influential people in AI in both 2023 and 2024.⁴⁵

Background

Education

Leike completed his PhD at Australian National University between 2014 and 2016.⁶ His thesis, titled Nonparametric General Reinforcement Learning, addressed theoretical aspects of reinforcement learning, including work on agents acting in unknown environments modeled after the AIXI framework developed by his supervisor, Marcus Hutter.⁷ Hutter is known for research on universal AI and algorithmic information theory. During his PhD, Leike won the Best Student Paper award at the UAI (Uncertainty in Artificial Intelligence) conference for the paper "Thompson sampling is asymptotically optimal in general environments."⁸

Early Career

After completing his PhD in November 2016, Leike was appointed as a Machine Learning Research Fellow at the Future of Humanity Institute at the University of Oxford.⁸ He then joined Google DeepMind in 2016 as a Research Scientist.

Career Trajectory

Google DeepMind (2016–2021)

At Google DeepMind, Leike held the title of Senior Research Scientist and led the Agent Alignment Team, one of three teams within DeepMind's technical AGI group.⁹ His research aimed at making machine learning robust and beneficial, focusing on safety and alignment of reinforcement learning agents. His stated primary research question during this period was: how can competitive and scalable machine learning algorithms be designed to make sequential decisions in the absence of a reward function.⁹

Key work during this period included:

Lead authorship of the AI Safety Gridworlds paper (2017), which presented a suite of reinforcement learning environments designed to illustrate safety properties including safe interruptibility, avoiding side effects, reward gaming, safe exploration, robustness to self-modification, and distributional shift.¹⁰
Co-authorship of "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017) with Paul Christiano, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei — a cross-institutional collaboration spanning OpenAI and DeepMind.¹¹
Lead authorship of "Scalable agent alignment via reward modeling" (arXiv 2018), co-authored with David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg.¹²
Co-authorship of a 2020 DeepMind blog post on specification gaming with Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, and Shane Legg, which defined specification gaming as "a behaviour that satisfies the literal specification of an objective without achieving the intended outcome."¹³

Leike described his own role at DeepMind as prototyping reinforcement learning from human feedback.³

OpenAI (January 2021 – May 2024)

Leike joined OpenAI in January 2021 to lead alignment research.¹⁴ He announced this on X on January 22, 2021, stating: "Last week I joined @OpenAI to lead their alignment effort."¹⁴

At OpenAI he was involved in the development of InstructGPT, ChatGPT, and the alignment of GPT-4, and developed OpenAI's stated approach to alignment research.³

In July 2023, OpenAI announced the formation of the Superalignment team, co-led by Leike and Ilya Sutskever (then OpenAI's Chief Scientist). OpenAI pledged 20% of the compute it had secured at the time to the effort, with a stated goal of solving the core technical challenges of superintelligence alignment within four years.¹ The team was recruited from OpenAI's existing alignment researchers and other internal teams, and was also hiring machine learning researchers and engineers new to alignment research.¹⁵

Leike departed OpenAI on May 16, 2024. This is described further in the Departure from OpenAI section below.

Anthropic (2024–present)

Following his departure from OpenAI, Leike joined Anthropic, where he heads the Alignment Science team.³ The team has published research including work on alignment faking (December 2024, co-produced with Redwood Research) and maintains a public research blog.¹⁶¹⁷

Key Contributions

RLHF Research

The 2017 NeurIPS paper "Deep Reinforcement Learning from Human Preferences," co-authored by Leike with Paul Christiano, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei, demonstrated that reinforcement learning agents could learn complex tasks — including Atari games and simulated robot locomotion — from human preferences between pairs of trajectory segments, without requiring a pre-specified reward function.¹¹ The approach required human feedback on approximately 0.1% of agent interactions, which the authors argued reduced oversight costs enough for practical application.¹¹ RLHF has multiple independent research threads across the field; Leike's 2017 paper is among the early works associated with scaling it to more complex tasks.

The 2018 arXiv paper "Scalable agent alignment via reward modeling," led by Leike and co-authored with David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg, presented reward modeling as a research direction for agent alignment, drawing on and synthesizing prior work in the field.¹² This paper was published as an arXiv preprint and was not peer-reviewed at a conference or journal.

AI Safety Gridworlds

The 2017 paper "AI Safety Gridworlds," for which Leike was the lead and first author, presented a suite of reinforcement learning environments designed to illustrate specific AI safety problems empirically.¹⁰ Co-authors included Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg, all at DeepMind. The paper evaluated the A2C and Rainbow algorithms on these environments and found neither solved the environments satisfactorily, categorizing AI safety problems into robustness and specification problems.¹⁰ The work built on the conceptual framework of the "Concrete Problems in AI Safety" paper (Amodei et al., 2016), of which Leike was not a co-author.

Scalable Oversight Research

Leike's research has addressed the challenge of supervising AI systems that may be more capable than human evaluators, through:

Recursive reward modeling approaches, where AI systems assist humans in evaluating other AI systems
The 2021 paper "Recursively Summarizing Books with Human Feedback" (arXiv:2109.10862), co-authored with Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, and Paul Christiano, which combined human feedback with recursive task decomposition to summarize full-length books using GPT-3, achieving results on the BookSum and NarrativeQA benchmarks.¹⁸ In this paper, Leike and Paul Christiano are credited as having managed the team.
Weak-to-Strong Generalization research examining whether less capable supervisors can effectively oversee more capable systems
Comparisons between process supervision (evaluating reasoning steps) and outcome supervision (evaluating final results)

Superalignment Team

The Superalignment team at OpenAI, co-led by Leike and Ilya Sutskever, was announced in July 2023 with a stated plan to build a roughly human-level AI alignment researcher that could then be used to solve the harder problem of aligning superintelligent systems.¹⁵ OpenAI committed 20% of its then-current compute to the effort.¹

Research Publications

Selected publications, with full author lists and venues:

"AI Safety Gridworlds" (arXiv:1711.09883, November 2017) — Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg. All DeepMind. Leike is first author.¹⁰
"Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017, pp. 4299–4307; arXiv:1706.03741) — Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei. Cross-institutional (OpenAI and DeepMind).¹¹
"Scalable agent alignment via reward modeling: a research direction" (arXiv:1811.07871, November 2018) — Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg. arXiv preprint; not peer-reviewed at a conference.¹²
"Specification gaming: the flip side of AI ingenuity" (DeepMind blog post, 2020) — Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg. Published as a blog post; cited in academic literature in that form.¹³
"Recursively Summarizing Books with Human Feedback" (arXiv:2109.10862, September 2021) — Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano. OpenAI Alignment team.¹⁸

Departure from OpenAI

On May 17, 2024, Leike posted a thread on X announcing that May 16, 2024 had been his last day at OpenAI.² The thread received approximately 6.1 million views and 11,000 reposts.²

In his posts, Leike stated his reasons for departing. His stated concerns, attributed here to him directly, included:

"I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point."²
"safety culture and processes have taken a backseat to shiny products"²
"over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done."²
"building smarter-than-human machines is an inherently dangerous endeavor"²

These represent Leike's stated account of his departure. Sam Altman responded on X: "I'm super appreciative of @janleike's contributions to OpenAI's alignment research and safety culture, and very sad to see him leave. He's right we have a lot more to do; we are committed to doing it."¹⁹ Sam Altman and Greg Brockman subsequently posted a joint note stating that OpenAI was "not sure yet when we'll reach our safety bar for releases, and it's ok if that pushes out release timelines."²⁰

Reporting by Fortune, citing approximately half a dozen sources familiar with the Superalignment team's work, stated that OpenAI had not fulfilled its announced commitment to allocate 20% of its compute to the Superalignment team, and that the team had repeatedly seen GPU access requests declined.²¹ OpenAI did not directly comment on the compute allocation claim to several outlets, directing reporters to Altman's X post.²²

Context: Superalignment Team Dissolution and Broader Departures

Leike's departure occurred days after Ilya Sutskever announced his own departure from OpenAI, with both announcements made on X within hours of each other on May 14, 2024.²² Following both co-leaders' departures, OpenAI confirmed to CNN that it had dissolved the Superalignment team as a standalone unit, reassigning its approximately 25 members across other research groups.²² Jakub Pachocki was named as new Chief Scientist to replace Sutskever.²²

Leike's departure was part of a broader pattern of departures from OpenAI's safety-focused staff in 2024. Fortune reported that at least six other AI safety researchers had left OpenAI from different teams in the months surrounding Leike's departure, including Daniel Kokotajlo, who told Vox he "gradually lost trust in OpenAI leadership and their ability to responsibly handle AGI."²¹

Research Focus at Anthropic

Current Research Priorities

At Anthropic, Leike heads the Alignment Science team. Research areas the team has pursued include:

Weak-to-strong generalization: Investigating methods by which less capable systems (including humans) can effectively supervise and evaluate more capable AI systems. The paper "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision" (ICML 2024) was initiated at OpenAI prior to Leike's departure and published in 2024.³
Scalable oversight techniques: Developing approaches to make human feedback mechanisms effective for systems that may exceed human capabilities in specific domains. Leike told TIME in 2024 that he believes aligning larger systems will increasingly be automated by smaller, trusted models as alignment science "becomes more and more mature."⁵
Alignment faking research: The team's December 2024 paper, co-produced with Redwood Research, demonstrated that Claude 3 Opus strategically halted certain refusals during training to preserve its preferences — behavior the authors described as emergent rather than explicitly trained.¹⁶
Behavioral evaluation: The team's public blog describes work on automated behavioral evaluations, model organisms of misalignment, and monitoring techniques to detect whether models reason about malicious tasks while evading detection.¹⁷
Automated alignment research: Leike has argued that evaluation is easier than generation for many tasks, including alignment research, which enables AI systems to assist in performing alignment research.²³

Technical Challenges

Research challenges Leike has discussed across interviews and publications include:

Reward hacking: Systems optimizing proxy measures rather than intended objectives
Distributional shift: Maintaining alignment when systems encounter situations outside their training distribution
Deceptive alignment / Scheming: Risks that systems might behave differently during evaluation than during deployment
Scalable supervision: Ensuring human oversight remains meaningful as AI capabilities increase

Public Statements on AI Risk and Development

The following views are attributed to Leike based on specific public sources.

On timelines: Leike stated in a 2023 podcast interview with 80,000 Hours: "While superintelligence seems far off now, we believe it could arrive this decade."²³ He described his approach as focusing on the more tractable near-term problem of aligning the next generation of AI systems rather than superintelligence directly, stating: "If you're thinking about how do you align the superintelligence... I don't know. I don't have an answer."²³

On tractability: TIME described Leike in 2023 as "more optimistic than many who work on preventing AI-related catastrophe."⁴ He was quoted: "So much is still up in the air. Humans have a lot of ownership over what happens, and we should try hard to make it go well."⁴

On safety culture: In his May 2024 X thread, Leike stated that he believed "much more of OpenAI's bandwidth should be spent" on "security, monitoring, preparedness, safety, and societal impact."² He stated: "OpenAI shoulders an enormous responsibility on behalf of all of humanity."²

On alignment automation: In the 2023 AXRP podcast with Daniel Filan, Leike discussed how the Superalignment team's approach centered on training a roughly human-level automated alignment researcher that would then be asked to solve the harder problem of aligning more capable systems.¹⁵

On his research approach: Leike's research has consistently emphasized empirical testing with existing AI systems rather than purely theoretical work, and developing techniques that can be adapted as systems become more capable. His personal website describes the 80,000 Hours podcast as "the best introduction into my thinking in podcast form, especially if you're coming from machine learning."³

Public Communication

Leike has communicated about alignment research through multiple channels:

X (formerly Twitter) @janleike: Regular posts on alignment challenges, safety concerns, and research directions
Substack blog: aligned.substack.com — a blog on alignment research
Podcast appearances: Including 80,000 Hours (episode #159, 2023),²³ AXRP with Daniel Filan (episode 24, July 2023),¹⁵ and Future of Life Institute's AI Alignment Podcast (circa 2019–2020)⁹
May 2024 X thread: His departure statement attracted substantial public discussion; the thread received approximately 6.1 million views²

His personal website at jan.leike.name lists publications and recommended resources.

Recognition

TIME magazine included Leike in its "100 Most Influential People in AI" list in both 2023 and 2024.⁴⁵ The 2023 entry noted he was 36 years old at the time of publication and described him as "more optimistic than many who work on preventing AI-related catastrophe."⁴

"Introducing Superalignment," — OpenAI, "Introducing Superalignment," July 2023. ↩ ↩² ↩³
X thread announcing departure from OpenAI — Jan Leike, X thread announcing departure from OpenAI, posted May 17, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
personal website — Jan Leike, personal website, ongoing (accessed 2024–2025). ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Citation rc-bb8b ↩ ↩² ↩³ ↩⁴ ↩⁵
"Jan Leike: The 100 Most Influential People in AI 2024," — TIME Magazine, "Jan Leike: The 100 Most Influential People in AI 2024," 2024. ↩ ↩² ↩³
Jan Leike profile — OpenReview, Jan Leike profile, ongoing. ↩
"Nonparametric General Reinforcement Learning," — Jan Leike, "Nonparametric General Reinforcement Learning," PhD thesis, Australian National University, November 2016. ↩
"Strategic Artificial Intelligence Research Centre New Hires," — Future of Humanity Institute, "Strategic Artificial Intelligence Research Centre New Hires," 2016. ↩ ↩²
"AI Alignment Podcast: On DeepMind, AI Safety, and Recursive Reward Modeling with Jan Leike," — Future of Life Institute, "AI Alignment Podcast: On DeepMind, AI Safety, and Recursive Reward Modeling with Jan Leike," circa 2019–2020. ↩ ↩² ↩³
"AI Safety Gridworlds," — Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg, "AI Safety Gridworlds," arXiv:1711.09883, November 2017. ↩ ↩² ↩³ ↩⁴
"Deep Reinforcement Learning from Human Preferences," — Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, "Deep Reinforcement Learning from Human Preferences," Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pp. 4299–4307; arXiv:1706.03741. ↩ ↩² ↩³ ↩⁴
"Scalable agent alignment via reward modeling: a research direction," — Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg, "Scalable agent alignment via reward modeling: a research direction," arXiv:1811.07871, November 2018. ↩ ↩² ↩³
"Specification gaming: the flip side of AI ingenuity," — Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg, "Specification gaming: the flip side of AI ingenuity," DeepMind blog, 2020. ↩ ↩²
X post announcing joining OpenAI — Jan Leike, X post announcing joining OpenAI, January 22, 2021. ↩ ↩²
"Episode 24 — Superalignment with Jan Leike," — Daniel Filan (host), "Episode 24 — Superalignment with Jan Leike," AXRP — the AI X-risk Research Podcast, July 27, 2023. ↩ ↩² ↩³ ↩⁴
"Alignment faking in large language models," — Anthropic Alignment Science team / Redwood Research, "Alignment faking in large language models," December 20, 2024. ↩ ↩²
Alignment Science Blog — Anthropic Alignment Science team, Alignment Science Blog, 2024–2025. ↩ ↩²
"Recursively Summarizing Books with Human Feedback," — Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano, "Recursively Summarizing Books with Human Feedback," arXiv:2109.10862, September 22, 2021. ↩ ↩²
X post responding to Jan Leike's departure — Sam Altman, X post responding to Jan Leike's departure, May 17, 2024. ↩
"OpenAI's recent departures force leaders to reaffirm safety commitment," — Axios, "OpenAI's recent departures force leaders to reaffirm safety commitment," May 20, 2024. ↩
"OpenAI promised 20% of its computing power to combat the most dangerous kind of AI — but never delivered," — Fortune, "OpenAI promised 20% of its computing power to combat the most dangerous kind of AI — but never delivered," May 21, 2024. ↩ ↩²
"OpenAI dissolves Superalignment AI safety team," — CNBC, "OpenAI dissolves Superalignment AI safety team," May 17, 2024. ↩ ↩² ↩³ ↩⁴
"Episode #159 — Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less," — 80,000 Hours, "Episode #159 — Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less," 2023. ↩ ↩² ↩³ ↩⁴

References

1Introducing SuperalignmentOpenAI▸

OpenAI announces the formation of its Superalignment team, co-led by Ilya Sutskever and Jan Leike, with a goal of solving the core technical challenges of superintelligence alignment within four years. The team will dedicate 20% of OpenAI's secured compute to developing scalable oversight, automated interpretability, robustness testing, and adversarial testing methods to align AI systems smarter than humans.

★★★★☆

openai.com

2[2109.10862] Recursively Summarizing Books with Human FeedbackarXiv·Jeff Wu et al.·2021·Paper▸

This paper presents a method for summarizing entire fiction novels by combining reinforcement learning from human feedback (RLHF) with recursive task decomposition, enabling human supervisors to provide feedback on complex tasks without needing to evaluate the full output themselves. The approach fine-tunes GPT-3 via behavioral cloning and reward modeling, achieving state-of-the-art results on BookSum and competitive results on NarrativeQA.

★★★☆☆

arxiv.org

3Jan Leikejan.leike.name▸

jan.leike.name

Property	Value	As Of	Source
Wikipedia	https://en.wikipedia.org/wiki/Jan_Leike
Education	PhD in Machine Learning, Australian National University
Notable For	VP of Alignment Science at Anthropic; former co-lead of OpenAI Superalignment team; prominent advocate for AI safety resource allocation
Social Media	@janleike

Organization	Title	Start	End
Anthropic	Co-lead, Alignment Science Team	2024-05	—
OpenAI	Head of Alignment / Superalignment Co-lead	2021-01	2024-05

Jan Leike

Jan Leike

Quick Assessment

Overview

Background

Education

Early Career

Career Trajectory

Google DeepMind (2016–2021)

OpenAI (January 2021 – May 2024)

Anthropic (2024–present)

Key Contributions

RLHF Research

AI Safety Gridworlds

Scalable Oversight Research

Superalignment Team

Research Publications

Departure from OpenAI

Context: Superalignment Team Dissolution and Broader Departures

Research Focus at Anthropic

Current Research Priorities

Technical Challenges

Public Statements on AI Risk and Development

Public Communication

Recognition

Footnotes

References

Structured Data

All Facts

Career History

Related Wiki Pages

Top Related Pages

Anthropic

Scalable Oversight

OpenAI

RLHF

Dario Amodei

Other

Concepts

Approaches

Risks

Key Debates

Organizations

Analysis

Safety Research