Jan Leike
Jan Leike
Biography of Jan Leike covering his career from Australian National University through DeepMind, OpenAI's Superalignment team, to his current role as head of the Alignment Science team at Anthropic. Documents his research on RLHF and scalable oversight, his May 2024 departure from OpenAI, and his current research priorities including weak-to-strong generalization and automated alignment techniques.
Jan Leike
Biography of Jan Leike covering his career from Australian National University through DeepMind, OpenAI's Superalignment team, to his current role as head of the Alignment Science team at Anthropic. Documents his research on RLHF and scalable oversight, his May 2024 departure from OpenAI, and his current research priorities including weak-to-strong generalization and automated alignment techniques.
Quick Assessment
| Dimension | Assessment |
|---|---|
| Primary Role | Head of Alignment Science at Anthropic (2024–present) |
| Key Contributions | Co-authored early RLHF research; led the Agent Alignment Team at Google DeepMind; co-led OpenAI's Superalignment team; developed Reward Modeling frameworks |
| Key Publications | "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017); "Scalable agent alignment via reward modeling" (arXiv 2018); "AI Safety Gridworlds" (arXiv 2017); "Recursively Summarizing Books with Human Feedback" (arXiv 2021) |
| Career Trajectory | PhD, Australian National University (2016) → FHI postdoc (2016) → Senior Research Scientist, Google DeepMind (2016–2021) → Head of Alignment / Superalignment co-lead, OpenAI (January 2021 – May 2024) → Anthropic (2024–present) |
| Notable Event | Departed OpenAI on May 16, 2024; posted publicly on X about his stated reasons for leaving |
Overview
Jan Leike is an AI alignment researcher who has held senior roles at Google DeepMind, OpenAI, and Anthropic. He completed a PhD in reinforcement learning theory at Australian National University in 2016 under the supervision of Marcus Hutter, and subsequently held a brief research fellowship at the Future of Humanity Institute. At DeepMind, he led the Agent Alignment Team and contributed to early RLHF research. He joined OpenAI in January 2021 to lead alignment research, and in July 2023 co-led the formation of the Superalignment team alongside Ilya Sutskever, with a stated goal of solving Superintelligence within four years.1 He departed OpenAI on May 16, 2024, posting a public thread on X explaining his stated reasons for leaving.2 He subsequently joined Anthropic, where he heads the Alignment Science team.3 TIME magazine listed him among the 100 most influential people in AI in both 2023 and 2024.45
Background
Education
Leike completed his PhD at Australian National University between 2014 and 2016.6 His thesis, titled Nonparametric General Reinforcement Learning, addressed theoretical aspects of reinforcement learning, including work on agents acting in unknown environments modeled after the AIXI framework developed by his supervisor, Marcus Hutter.7 Hutter is known for research on universal AI and algorithmic information theory. During his PhD, Leike won the Best Student Paper award at the UAI (Uncertainty in Artificial Intelligence) conference for the paper "Thompson sampling is asymptotically optimal in general environments."8
Early Career
After completing his PhD in November 2016, Leike was appointed as a Machine Learning Research Fellow at the Future of Humanity Institute at the University of Oxford.8 He then joined Google DeepMind in 2016 as a Research Scientist.
Career Trajectory
Google DeepMind (2016–2021)
At Google DeepMind, Leike held the title of Senior Research Scientist and led the Agent Alignment Team, one of three teams within DeepMind's technical AGI group.9 His research aimed at making machine learning robust and beneficial, focusing on safety and alignment of reinforcement learning agents. His stated primary research question during this period was: how can competitive and scalable machine learning algorithms be designed to make sequential decisions in the absence of a reward function.9
Key work during this period included:
- Lead authorship of the AI Safety Gridworlds paper (2017), which presented a suite of reinforcement learning environments designed to illustrate safety properties including safe interruptibility, avoiding side effects, reward gaming, safe exploration, robustness to self-modification, and distributional shift.10
- Co-authorship of "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017) with Paul Christiano, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei — a cross-institutional collaboration spanning OpenAI and DeepMind.11
- Lead authorship of "Scalable agent alignment via reward modeling" (arXiv 2018), co-authored with David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg.12
- Co-authorship of a 2020 DeepMind blog post on specification gaming with Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, and Shane Legg, which defined specification gaming as "a behaviour that satisfies the literal specification of an objective without achieving the intended outcome."13
Leike described his own role at DeepMind as prototyping reinforcement learning from human feedback.3
OpenAI (January 2021 – May 2024)
Leike joined OpenAI in January 2021 to lead alignment research.14 He announced this on X on January 22, 2021, stating: "Last week I joined @OpenAI to lead their alignment effort."14
At OpenAI he was involved in the development of InstructGPT, ChatGPT, and the alignment of GPT-4, and developed OpenAI's stated approach to alignment research.3
In July 2023, OpenAI announced the formation of the Superalignment team, co-led by Leike and Ilya Sutskever (then OpenAI's Chief Scientist). OpenAI pledged 20% of the compute it had secured at the time to the effort, with a stated goal of solving the core technical challenges of superintelligence alignment within four years.1 The team was recruited from OpenAI's existing alignment researchers and other internal teams, and was also hiring machine learning researchers and engineers new to alignment research.15
Leike departed OpenAI on May 16, 2024. This is described further in the Departure from OpenAI section below.
Anthropic (2024–present)
Following his departure from OpenAI, Leike joined Anthropic, where he heads the Alignment Science team.3 The team has published research including work on alignment faking (December 2024, co-produced with Redwood Research) and maintains a public research blog.1617
Key Contributions
RLHF Research
The 2017 NeurIPS paper "Deep Reinforcement Learning from Human Preferences," co-authored by Leike with Paul Christiano, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei, demonstrated that reinforcement learning agents could learn complex tasks — including Atari games and simulated robot locomotion — from human preferences between pairs of trajectory segments, without requiring a pre-specified reward function.11 The approach required human feedback on approximately 0.1% of agent interactions, which the authors argued reduced oversight costs enough for practical application.11 RLHF has multiple independent research threads across the field; Leike's 2017 paper is among the early works associated with scaling it to more complex tasks.
The 2018 arXiv paper "Scalable agent alignment via reward modeling," led by Leike and co-authored with David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg, presented reward modeling as a research direction for agent alignment, drawing on and synthesizing prior work in the field.12 This paper was published as an arXiv preprint and was not peer-reviewed at a conference or journal.
AI Safety Gridworlds
The 2017 paper "AI Safety Gridworlds," for which Leike was the lead and first author, presented a suite of reinforcement learning environments designed to illustrate specific AI safety problems empirically.10 Co-authors included Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg, all at DeepMind. The paper evaluated the A2C and Rainbow algorithms on these environments and found neither solved the environments satisfactorily, categorizing AI safety problems into robustness and specification problems.10 The work built on the conceptual framework of the "Concrete Problems in AI Safety" paper (Amodei et al., 2016), of which Leike was not a co-author.
Scalable Oversight Research
Leike's research has addressed the challenge of supervising AI systems that may be more capable than human evaluators, through:
- Recursive reward modeling approaches, where AI systems assist humans in evaluating other AI systems
- The 2021 paper "Recursively Summarizing Books with Human Feedback" (arXiv:2109.10862), co-authored with Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, and Paul Christiano, which combined human feedback with recursive task decomposition to summarize full-length books using GPT-3, achieving results on the BookSum and NarrativeQA benchmarks.18 In this paper, Leike and Paul Christiano are credited as having managed the team.
- Weak-to-Strong Generalization research examining whether less capable supervisors can effectively oversee more capable systems
- Comparisons between process supervision (evaluating reasoning steps) and outcome supervision (evaluating final results)
Superalignment Team
The Superalignment team at OpenAI, co-led by Leike and Ilya Sutskever, was announced in July 2023 with a stated plan to build a roughly human-level AI alignment researcher that could then be used to solve the harder problem of aligning superintelligent systems.15 OpenAI committed 20% of its then-current compute to the effort.1
Research Publications
Selected publications, with full author lists and venues:
- "AI Safety Gridworlds" (arXiv:1711.09883, November 2017) — Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg. All DeepMind. Leike is first author.10
- "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017, pp. 4299–4307; arXiv:1706.03741) — Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei. Cross-institutional (OpenAI and DeepMind).11
- "Scalable agent alignment via reward modeling: a research direction" (arXiv:1811.07871, November 2018) — Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg. arXiv preprint; not peer-reviewed at a conference.12
- "Specification gaming: the flip side of AI ingenuity" (DeepMind blog post, 2020) — Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg. Published as a blog post; cited in academic literature in that form.13
- "Recursively Summarizing Books with Human Feedback" (arXiv:2109.10862, September 2021) — Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano. OpenAI Alignment team.18
Departure from OpenAI
On May 17, 2024, Leike posted a thread on X announcing that May 16, 2024 had been his last day at OpenAI.2 The thread received approximately 6.1 million views and 11,000 reposts.2
In his posts, Leike stated his reasons for departing. His stated concerns, attributed here to him directly, included:
- "I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point."2
- "safety culture and processes have taken a backseat to shiny products"2
- "over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done."2
- "building smarter-than-human machines is an inherently dangerous endeavor"2
These represent Leike's stated account of his departure. Sam Altman responded on X: "I'm super appreciative of @janleike's contributions to OpenAI's alignment research and safety culture, and very sad to see him leave. He's right we have a lot more to do; we are committed to doing it."19 Sam Altman and Greg Brockman subsequently posted a joint note stating that OpenAI was "not sure yet when we'll reach our safety bar for releases, and it's ok if that pushes out release timelines."20
Reporting by Fortune, citing approximately half a dozen sources familiar with the Superalignment team's work, stated that OpenAI had not fulfilled its announced commitment to allocate 20% of its compute to the Superalignment team, and that the team had repeatedly seen GPU access requests declined.21 OpenAI did not directly comment on the compute allocation claim to several outlets, directing reporters to Altman's X post.22
Context: Superalignment Team Dissolution and Broader Departures
Leike's departure occurred days after Ilya Sutskever announced his own departure from OpenAI, with both announcements made on X within hours of each other on May 14, 2024.22 Following both co-leaders' departures, OpenAI confirmed to CNN that it had dissolved the Superalignment team as a standalone unit, reassigning its approximately 25 members across other research groups.22 Jakub Pachocki was named as new Chief Scientist to replace Sutskever.22
Leike's departure was part of a broader pattern of departures from OpenAI's safety-focused staff in 2024. Fortune reported that at least six other AI safety researchers had left OpenAI from different teams in the months surrounding Leike's departure, including Daniel Kokotajlo, who told Vox he "gradually lost trust in OpenAI leadership and their ability to responsibly handle AGI."21
Research Focus at Anthropic
Current Research Priorities
At Anthropic, Leike heads the Alignment Science team. Research areas the team has pursued include:
- Weak-to-strong generalization: Investigating methods by which less capable systems (including humans) can effectively supervise and evaluate more capable AI systems. The paper "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision" (ICML 2024) was initiated at OpenAI prior to Leike's departure and published in 2024.3
- Scalable oversight techniques: Developing approaches to make human feedback mechanisms effective for systems that may exceed human capabilities in specific domains. Leike told TIME in 2024 that he believes aligning larger systems will increasingly be automated by smaller, trusted models as alignment science "becomes more and more mature."5
- Alignment faking research: The team's December 2024 paper, co-produced with Redwood Research, demonstrated that Claude 3 Opus strategically halted certain refusals during training to preserve its preferences — behavior the authors described as emergent rather than explicitly trained.16
- Behavioral evaluation: The team's public blog describes work on automated behavioral evaluations, model organisms of misalignment, and monitoring techniques to detect whether models reason about malicious tasks while evading detection.17
- Automated alignment research: Leike has argued that evaluation is easier than generation for many tasks, including alignment research, which enables AI systems to assist in performing alignment research.23
Technical Challenges
Research challenges Leike has discussed across interviews and publications include:
- Reward hacking: Systems optimizing proxy measures rather than intended objectives
- Distributional shift: Maintaining alignment when systems encounter situations outside their training distribution
- Deceptive alignment / Scheming: Risks that systems might behave differently during evaluation than during deployment
- Scalable supervision: Ensuring human oversight remains meaningful as AI capabilities increase
Public Statements on AI Risk and Development
The following views are attributed to Leike based on specific public sources.
On timelines: Leike stated in a 2023 podcast interview with 80,000 Hours: "While superintelligence seems far off now, we believe it could arrive this decade."23 He described his approach as focusing on the more tractable near-term problem of aligning the next generation of AI systems rather than superintelligence directly, stating: "If you're thinking about how do you align the superintelligence... I don't know. I don't have an answer."23
On tractability: TIME described Leike in 2023 as "more optimistic than many who work on preventing AI-related catastrophe."4 He was quoted: "So much is still up in the air. Humans have a lot of ownership over what happens, and we should try hard to make it go well."4
On safety culture: In his May 2024 X thread, Leike stated that he believed "much more of OpenAI's bandwidth should be spent" on "security, monitoring, preparedness, safety, and societal impact."2 He stated: "OpenAI shoulders an enormous responsibility on behalf of all of humanity."2
On alignment automation: In the 2023 AXRP podcast with Daniel Filan, Leike discussed how the Superalignment team's approach centered on training a roughly human-level automated alignment researcher that would then be asked to solve the harder problem of aligning more capable systems.15
On his research approach: Leike's research has consistently emphasized empirical testing with existing AI systems rather than purely theoretical work, and developing techniques that can be adapted as systems become more capable. His personal website describes the 80,000 Hours podcast as "the best introduction into my thinking in podcast form, especially if you're coming from machine learning."3
Public Communication
Leike has communicated about alignment research through multiple channels:
- X (formerly Twitter) @janleike: Regular posts on alignment challenges, safety concerns, and research directions
- Substack blog: aligned.substack.com — a blog on alignment research
- Podcast appearances: Including 80,000 Hours (episode #159, 2023),23 AXRP with Daniel Filan (episode 24, July 2023),15 and Future of Life Institute's AI Alignment Podcast (circa 2019–2020)9
- May 2024 X thread: His departure statement attracted substantial public discussion; the thread received approximately 6.1 million views2
His personal website at jan.leike.name lists publications and recommended resources.
Recognition
TIME magazine included Leike in its "100 Most Influential People in AI" list in both 2023 and 2024.45 The 2023 entry noted he was 36 years old at the time of publication and described him as "more optimistic than many who work on preventing AI-related catastrophe."4
Footnotes
-
"Introducing Superalignment," — OpenAI, "Introducing Superalignment," July 2023. ↩ ↩2 ↩3
-
X thread announcing departure from OpenAI — Jan Leike, X thread announcing departure from OpenAI, posted May 17, 2024. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10
-
personal website — Jan Leike, personal website, ongoing (accessed 2024–2025). ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Citation rc-bb8b (data unavailable — rebuild with wiki-server access) ↩ ↩2 ↩3 ↩4 ↩5
-
"Jan Leike: The 100 Most Influential People in AI 2024," — TIME Magazine, "Jan Leike: The 100 Most Influential People in AI 2024," 2024. ↩ ↩2 ↩3
-
Jan Leike profile — OpenReview, Jan Leike profile, ongoing. ↩
-
"Nonparametric General Reinforcement Learning," — Jan Leike, "Nonparametric General Reinforcement Learning," PhD thesis, Australian National University, November 2016. ↩
-
"Strategic Artificial Intelligence Research Centre New Hires," — Future of Humanity Institute, "Strategic Artificial Intelligence Research Centre New Hires," 2016. ↩ ↩2
-
"AI Alignment Podcast: On DeepMind, AI Safety, and Recursive Reward Modeling with Jan Leike," — Future of Life Institute, "AI Alignment Podcast: On DeepMind, AI Safety, and Recursive Reward Modeling with Jan Leike," circa 2019–2020. ↩ ↩2 ↩3
-
"AI Safety Gridworlds," — Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg, "AI Safety Gridworlds," arXiv:1711.09883, November 2017. ↩ ↩2 ↩3 ↩4
-
"Deep Reinforcement Learning from Human Preferences," — Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, "Deep Reinforcement Learning from Human Preferences," Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pp. 4299–4307; arXiv:1706.03741. ↩ ↩2 ↩3 ↩4
-
"Scalable agent alignment via reward modeling: a research direction," — Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg, "Scalable agent alignment via reward modeling: a research direction," arXiv:1811.07871, November 2018. ↩ ↩2 ↩3
-
"Specification gaming: the flip side of AI ingenuity," — Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg, "Specification gaming: the flip side of AI ingenuity," DeepMind blog, 2020. ↩ ↩2
-
X post announcing joining OpenAI — Jan Leike, X post announcing joining OpenAI, January 22, 2021. ↩ ↩2
-
"Episode 24 — Superalignment with Jan Leike," — Daniel Filan (host), "Episode 24 — Superalignment with Jan Leike," AXRP — the AI X-risk Research Podcast, July 27, 2023. ↩ ↩2 ↩3 ↩4
-
"Alignment faking in large language models," — Anthropic Alignment Science team / Redwood Research, "Alignment faking in large language models," December 20, 2024. ↩ ↩2
-
Alignment Science Blog — Anthropic Alignment Science team, Alignment Science Blog, 2024–2025. ↩ ↩2
-
"Recursively Summarizing Books with Human Feedback," — Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano, "Recursively Summarizing Books with Human Feedback," arXiv:2109.10862, September 22, 2021. ↩ ↩2
-
X post responding to Jan Leike's departure — Sam Altman, X post responding to Jan Leike's departure, May 17, 2024. ↩
-
"OpenAI's recent departures force leaders to reaffirm safety commitment," — Axios, "OpenAI's recent departures force leaders to reaffirm safety commitment," May 20, 2024. ↩
-
"OpenAI promised 20% of its computing power to combat the most dangerous kind of AI — but never delivered," — Fortune, "OpenAI promised 20% of its computing power to combat the most dangerous kind of AI — but never delivered," May 21, 2024. ↩ ↩2
-
"OpenAI dissolves Superalignment AI safety team," — CNBC, "OpenAI dissolves Superalignment AI safety team," May 17, 2024. ↩ ↩2 ↩3 ↩4
-
"Episode #159 — Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less," — 80,000 Hours, "Episode #159 — Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less," 2023. ↩ ↩2 ↩3 ↩4