Palisade Research, 2025
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Abstract
We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like OpenAI o3 and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Hard | Argument | 69.0 |
| Goal Misgeneralization | Risk | 63.0 |
Cached Content Preview
[2502.13295] Demonstrating specification gaming in reasoning models
-->
Computer Science > Artificial Intelligence
arXiv:2502.13295 (cs)
[Submitted on 18 Feb 2025 ( v1 ), last revised 27 Aug 2025 (this version, v3)]
Title: Demonstrating specification gaming in reasoning models
Authors: Alexander Bondarenko , Denis Volk , Dmitrii Volkov , Jeffrey Ladish View a PDF of the paper titled Demonstrating specification gaming in reasoning models, by Alexander Bondarenko and 3 other authors
View PDF
Abstract: We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like OpenAI o3 and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack.
We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing.
Comments:
Updated with o3 results, fixed fonts
Subjects:
Artificial Intelligence (cs.AI)
Cite as:
arXiv:2502.13295 [cs.AI]
(or
arXiv:2502.13295v3 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2502.13295
Focus to learn more
arXiv-issued DOI via DataCite
Submission history
From: Alexander Bondarenko [ view email ]
[v1]
Tue, 18 Feb 2025 21:32:24 UTC (548 KB)
[v2]
Thu, 15 May 2025 13:42:18 UTC (383 KB)
[v3]
Wed, 27 Aug 2025 11:15:11 UTC (403 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Demonstrating specification gaming in reasoning models, by Alexander Bondarenko and 3 other authors View PDF
view license
Current browse context: cs.AI
< prev
|
next >
new
|
recent
| 2025-02
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
export BibTeX citation
Loading...
BibTeX formatted citation
×
loading...
Data provided by:
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer ( What is the Explorer? )
Connected Papers Toggle
Connected Papers ( What is Connected Papers? )
Litmaps Toggle
Litmaps ( What is Litmaps? )
scite.ai Toggle
scite Smart Citations ( What are Smart Citations? )
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv ( What is al
... (truncated, 5 KB total)dccfa7405702077d | Stable ID: ZGY0ODk4NW