Back
Victoria Krakovna – Research Publications
webvkrakovna.wordpress.com·vkrakovna.wordpress.com/research/
Victoria Krakovna is a senior DeepMind safety researcher; this page indexes her body of work and is a useful entry point for understanding DeepMind's technical safety research agenda across side effects, power-seeking, and frontier model evaluation.
Metadata
Importance: 72/100homepage
Summary
This is the research publications page of Victoria Krakovna (DeepMind safety researcher), listing her papers spanning side effects avoidance, power-seeking, goal misgeneralization, tampering incentives, dangerous capabilities evaluation, and stealth/situational awareness in frontier models. The page serves as a comprehensive index of her contributions to technical AI safety research from 2017 to 2025.
Key Points
- •Covers foundational work on side effects penalties and relative reachability, including the influential AI Safety Gridworlds benchmark (2017).
- •Includes research on power-seeking tendencies in trained agents and quantifying stability of non-power-seeking behavior.
- •Features papers on tampering incentives (REALab, Decoupled Approval) and goal misgeneralization in RL agents.
- •Recent work (2024-2025) focuses on evaluating frontier models for dangerous capabilities, stealth, and situational/scheming awareness.
- •Several papers are products of MATS (ML Alignment Theory Scholars) mentorship projects, reflecting Krakovna's role in the safety research community.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Corrigibility Failure | Risk | 62.0 |
Cached Content Preview
HTTP 200Fetched Apr 7, 20266 KB
Research | Victoria Krakovna
Papers
Evaluating Frontier Models for Stealth and Situational Awareness . Mary Phuong, Roland Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah. Arxiv 2025. ( blog post )
Evaluating Frontier Models for Dangerous Capabilities . Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Toby Shevlane, et al. Arxiv 2024.
Limitations of Agents Simulated by Predictive Models . Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna (MATS project). Arxiv 2024.
Quantifying stability of non-power-seeking in artificial agents. Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna (MATS project). Arxiv 2024.
Power-seeking can be probable and predictive for trained agents . Victoria Krakovna and Janos Kramar. Arxiv, 2023. ( blog post )
Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals . Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton. Arxiv, 2022.
Avoiding Side Effects By Considering Future Tasks . Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg. Neural Information Processing Systems, 2020. ( arXiv , code , AN summary )
Avoiding Tampering Incentives in Deep RL via Decoupled Approval . Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg. ArXiv, 2020. ( blog post , AN summary )
REALab: An Embedded Perspective on Tampering . Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg. ArXiv, 2020. ( blog post )
Modeling AGI Safety Frameworks with Causal Influence Diagrams . Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg. IJCAI AI Safety workshop, 2019. ( AN summary )
Penalizing Side Effects Using Stepwise Relative Reachability . Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, Shane Legg. IJCAI AI Safety workshop, 2019 (version 2), 2018 (version 1). ( arXiv , version 2 blog post , version 1 blog post , code , AN summary of version 1 )
AI Safety Gridworlds . Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg. ArXiv, 2017. ( arXiv , blog post , code )
Reinforcement Learning with a Corrupted Reward Channel . Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg. IJCAI AI and Autonomy track, 2017. ( arXiv , demo , code )
Building Interpretable Models: From Bayesian Networks to Neural Networks . Victoria Krakovna (PhD thesis), 2016.
Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models . Victoria Krakovna and Finale Doshi-Velez. ICML Workshop on Human Interpretability 2016 ( arXiv ), NeurIPS Works
... (truncated, 6 KB total)Resource ID:
45af23d90ccfc785 | Stable ID: sid_2rouDlRG3a