Victoria Krakovna – Research Publications

web

vkrakovna.wordpress.com·vkrakovna.wordpress.com/research/

Victoria Krakovna is a senior DeepMind safety researcher; this page indexes her body of work and is a useful entry point for understanding DeepMind's technical safety research agenda across side effects, power-seeking, and frontier model evaluation.

Metadata

Importance: 72/100homepage

Summary

This is the research publications page of Victoria Krakovna (DeepMind safety researcher), listing her papers spanning side effects avoidance, power-seeking, goal misgeneralization, tampering incentives, dangerous capabilities evaluation, and stealth/situational awareness in frontier models. The page serves as a comprehensive index of her contributions to technical AI safety research from 2017 to 2025.

Key Points

•Covers foundational work on side effects penalties and relative reachability, including the influential AI Safety Gridworlds benchmark (2017).
•Includes research on power-seeking tendencies in trained agents and quantifying stability of non-power-seeking behavior.
•Features papers on tampering incentives (REALab, Decoupled Approval) and goal misgeneralization in RL agents.
•Recent work (2024-2025) focuses on evaluating frontier models for dangerous capabilities, stealth, and situational/scheming awareness.
•Several papers are products of MATS (ML Alignment Theory Scholars) mentorship projects, reflecting Krakovna's role in the safety research community.

Cited by 1 page

Page	Type	Quality
Corrigibility Failure	Risk	62.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20266 KB

Research | Victoria Krakovna 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 

 
 
 

 
 
 Papers

 Evaluating Frontier Models for Stealth and Situational Awareness . Mary Phuong, Roland Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah. Arxiv 2025. ( blog post )

 Evaluating Frontier Models for Dangerous Capabilities . Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Toby Shevlane, et al. Arxiv 2024.

 Limitations of Agents Simulated by Predictive Models . Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna (MATS project). Arxiv 2024.

 Quantifying stability of non-power-seeking in artificial agents. Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna (MATS project). Arxiv 2024.

 Power-seeking can be probable and predictive for trained agents . Victoria Krakovna and Janos Kramar. Arxiv, 2023. ( blog post )

 Goal Misgeneralization: Why Correct Specifications Aren&#8217;t Enough For Correct Goals . Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton. Arxiv, 2022. 

 Avoiding Side Effects By Considering Future Tasks . Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg. Neural Information Processing Systems, 2020. ( arXiv , code , AN summary )

 Avoiding Tampering Incentives in Deep RL via Decoupled Approval . Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg. ArXiv, 2020. ( blog post , AN summary )

 REALab: An Embedded Perspective on Tampering . Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg. ArXiv, 2020. ( blog post )

 Modeling AGI Safety Frameworks with Causal Influence Diagrams . Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg. IJCAI AI Safety workshop, 2019. ( AN summary )

 Penalizing Side Effects Using Stepwise Relative Reachability . Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, Shane Legg. IJCAI AI Safety workshop, 2019 (version 2), 2018 (version 1). ( arXiv , version 2 blog post , version 1 blog post , code , AN summary of version 1 )

 AI Safety Gridworlds . Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg. ArXiv, 2017. ( arXiv , blog post , code )

 Reinforcement Learning with a Corrupted Reward Channel . Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg. IJCAI AI and Autonomy track, 2017. ( arXiv , demo , code )

 Building Interpretable Models: From Bayesian Networks to Neural Networks . Victoria Krakovna (PhD thesis), 2016.

 Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models . Victoria Krakovna and Finale Doshi-Velez. ICML Workshop on Human Interpretability 2016 ( arXiv ), NeurIPS Works

... (truncated, 6 KB total)

Resource ID: 45af23d90ccfc785 | Stable ID: sid_2rouDlRG3a