Skip to content
Longterm Wiki
Back

Victoria Krakovna – Research Publications

web
vkrakovna.wordpress.com·vkrakovna.wordpress.com/research/

Victoria Krakovna is a senior DeepMind safety researcher; this page indexes her body of work and is a useful entry point for understanding DeepMind's technical safety research agenda across side effects, power-seeking, and frontier model evaluation.

Metadata

Importance: 72/100homepage

Summary

This is the research publications page of Victoria Krakovna (DeepMind safety researcher), listing her papers spanning side effects avoidance, power-seeking, goal misgeneralization, tampering incentives, dangerous capabilities evaluation, and stealth/situational awareness in frontier models. The page serves as a comprehensive index of her contributions to technical AI safety research from 2017 to 2025.

Key Points

  • Covers foundational work on side effects penalties and relative reachability, including the influential AI Safety Gridworlds benchmark (2017).
  • Includes research on power-seeking tendencies in trained agents and quantifying stability of non-power-seeking behavior.
  • Features papers on tampering incentives (REALab, Decoupled Approval) and goal misgeneralization in RL agents.
  • Recent work (2024-2025) focuses on evaluating frontier models for dangerous capabilities, stealth, and situational/scheming awareness.
  • Several papers are products of MATS (ML Alignment Theory Scholars) mentorship projects, reflecting Krakovna's role in the safety research community.

Cited by 1 page

PageTypeQuality
Corrigibility FailureRisk62.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20266 KB
Research | Victoria Krakovna 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 

 
 
 

 
 
 Papers

 Evaluating Frontier Models for Stealth and Situational Awareness . Mary Phuong, Roland Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah. Arxiv 2025. ( blog post )

 Evaluating Frontier Models for Dangerous Capabilities . Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Toby Shevlane, et al. Arxiv 2024.

 Limitations of Agents Simulated by Predictive Models . Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna (MATS project). Arxiv 2024.

 Quantifying stability of non-power-seeking in artificial agents. Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna (MATS project). Arxiv 2024.

 Power-seeking can be probable and predictive for trained agents . Victoria Krakovna and Janos Kramar. Arxiv, 2023. ( blog post )

 Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals . Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton. Arxiv, 2022. 

 Avoiding Side Effects By Considering Future Tasks . Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg. Neural Information Processing Systems, 2020. ( arXiv , code , AN summary )

 Avoiding Tampering Incentives in Deep RL via Decoupled Approval . Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg. ArXiv, 2020. ( blog post , AN summary )

 REALab: An Embedded Perspective on Tampering . Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg. ArXiv, 2020. ( blog post )

 Modeling AGI Safety Frameworks with Causal Influence Diagrams . Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg. IJCAI AI Safety workshop, 2019. ( AN summary )

 Penalizing Side Effects Using Stepwise Relative Reachability . Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, Shane Legg. IJCAI AI Safety workshop, 2019 (version 2), 2018 (version 1). ( arXiv , version 2 blog post , version 1 blog post , code , AN summary of version 1 )

 AI Safety Gridworlds . Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg. ArXiv, 2017. ( arXiv , blog post , code )

 Reinforcement Learning with a Corrupted Reward Channel . Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg. IJCAI AI and Autonomy track, 2017. ( arXiv , demo , code )

 Building Interpretable Models: From Bayesian Networks to Neural Networks . Victoria Krakovna (PhD thesis), 2016.

 Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models . Victoria Krakovna and Finale Doshi-Velez. ICML Workshop on Human Interpretability 2016 ( arXiv ), NeurIPS Works

... (truncated, 6 KB total)
Resource ID: 45af23d90ccfc785 | Stable ID: sid_2rouDlRG3a