Palisade Research, 2025

paper

2025·arXiv·arxiv.org/pdf/2502.13295

Authors

Alexander Bondarenko·Denis Volk·Dmitrii Volkov·Jeffrey Ladish

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like OpenAI o3 and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing.

Cited by 2 pages

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0
Goal Misgeneralization	Risk	63.0

Cached Content Preview

HTTP 200Fetched Mar 11, 20265 KB

[2502.13295] Demonstrating specification gaming in reasoning models 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 

 
 
 
 
 
--> 

 
 
 Computer Science > Artificial Intelligence

 

 
 arXiv:2502.13295 (cs)
 
 
 
 
 
 [Submitted on 18 Feb 2025 ( v1 ), last revised 27 Aug 2025 (this version, v3)] 
 Title: Demonstrating specification gaming in reasoning models

 Authors: Alexander Bondarenko , Denis Volk , Dmitrii Volkov , Jeffrey Ladish View a PDF of the paper titled Demonstrating specification gaming in reasoning models, by Alexander Bondarenko and 3 other authors 
 View PDF 

 
 Abstract: We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like OpenAI o3 and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won&#39;t work to hack.

We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)&#39;s o1 Docker escape during cyber capabilities testing.
 

 
 
 
 Comments: 
 Updated with o3 results, fixed fonts 
 
 
 Subjects: 
 
 Artificial Intelligence (cs.AI) 
 
 Cite as: 
 arXiv:2502.13295 [cs.AI] 
 
 
 
 (or 
 arXiv:2502.13295v3 [cs.AI] for this version)
 
 
 
 
 https://doi.org/10.48550/arXiv.2502.13295 
 
 
 Focus to learn more 
 
 
 
 arXiv-issued DOI via DataCite 
 
 
 
 
 
 
 
 Submission history

 From: Alexander Bondarenko [ view email ] 
 [v1] 
 Tue, 18 Feb 2025 21:32:24 UTC (548 KB)

 [v2] 
 Thu, 15 May 2025 13:42:18 UTC (383 KB)

 [v3] 
 Wed, 27 Aug 2025 11:15:11 UTC (403 KB)

 
 
 
 
 
 Full-text links: 
 Access Paper:

 
 
View a PDF of the paper titled Demonstrating specification gaming in reasoning models, by Alexander Bondarenko and 3 other authors View PDF 
 
 
 
 view license 
 
 
 
 Current browse context: cs.AI 

 
 
 < prev 
 
 | 
 next > 
 

 
 new 
 | 
 recent 
 | 2025-02 
 
 Change to browse by:
 
 cs 
 
 

 
 
 References & Citations

 
 NASA ADS 
 Google Scholar 

 Semantic Scholar 

 
 
 

 
 export BibTeX citation 
 Loading... 
 

 
 
 
 BibTeX formatted citation

 &times; 
 
 
 loading... 
 
 
 Data provided by: 
 
 
 
 
 Bookmark

 
 
 
 
 
 
 
 
 
 
 
 Bibliographic Tools 
 
 Bibliographic and Citation Tools

 
 
 
 
 
 
 Bibliographic Explorer Toggle 
 
 
 
 Bibliographic Explorer ( What is the Explorer? ) 
 
 
 
 
 
 
 
 Connected Papers Toggle 
 
 
 
 Connected Papers ( What is Connected Papers? ) 
 
 
 
 
 
 
 Litmaps Toggle 
 
 
 
 Litmaps ( What is Litmaps? ) 
 
 
 
 
 
 
 
 scite.ai Toggle 
 
 
 
 scite Smart Citations ( What are Smart Citations? ) 
 
 
 
 
 
 
 
 

 
 Code, Data, Media 
 
 Code, Data and Media Associated with this Article

 
 
 
 
 
 
 alphaXiv Toggle 
 
 
 
 alphaXiv ( What is al

... (truncated, 5 KB total)

Resource ID: dccfa7405702077d | Stable ID: ZGY0ODk4NW