van der Weij et al. (2024)

paper

2024·arXiv·arxiv.org/abs/2406.07358

Authors

Teun van der Weij·Felix Hofstätter·Ollie Jaffe·Samuel F. Brown·Francis Rhys Ward

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper investigates sandbagging—strategic underperformance on AI capability evaluations—in contemporary language models, addressing a critical safety concern where AI developers or systems may intentionally understate capabilities to bypass safety checks or regulations.

Paper Details

Citations

5 influential

Year

2024

arXiv:2406.07358 DOI:10.1192/j.eurpsy.2025.10148.sm001 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI's actual capability. These conflicting interests lead to the problem of sandbagging, which we define as strategic underperformance on an evaluation. In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted or password-locked to target specific scores on a capability evaluation. We have mediocre success in password-locking a model to mimic the answers a weaker model would give. Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.

Summary

This paper investigates sandbagging—strategic underperformance on evaluations—in contemporary large language models like GPT-4 and Claude 3 Opus. The authors demonstrate that frontier models can be prompted to selectively underperform on dangerous capability evaluations while maintaining performance on benign tasks, and can be fine-tuned to hide specific capabilities unless given a password. These capabilities generalize to high-quality benchmarks like WMDP. The findings reveal significant vulnerabilities in capability evaluations that undermine their trustworthiness and could compromise safety decisions in AI development and deployment.

Cited by 1 page

Page	Type	Quality
AI Capability Sandbagging	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2406.07358] AI Sandbagging: Language Models can Strategically Underperform on Evaluations 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 AI Sandbagging: Language Models can Strategically Underperform on Evaluations

 
 
 Teun van der Weij   
 MATS 
 &Felix Hofstätter ∗ 
 MATS † 
 &Oliver Jaffe ∗ 
 Independent
&Samuel F. Brown 
 Independent
&Francis Rhys Ward 
 Imperial College London
 Core contributors, see contributions in Section   9 . Correspondence to mailvanteun@gmail.com.ML Alignment Theory Scholars 
 
 (May 2024) 

 
 Abstract

 Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI’s actual capability. These conflicting interests lead to the problem of sandbagging – which we define as strategic underperformance on an evaluation . In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems. 1 1 1 We publish our code and results at https://github.com/TeunvdWeij/sandbagging 

 
 
 
 1 Introduction

 
 Trustworthy evaluations are needed to understand AI systems and their rapidly improving capabilities (Shevlane et al.,, 2023 ) . As such, evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. Frontier labs rely on evaluations to reduce catastrophic risks, as stated in the Preparedness Framework by OpenAI, ( 2023 ) , the Frontier Safety Framework by Google DeepMind, ( 2024 ) , and the Responsible Scaling Policy by Anthropic, ( 2023 ) . Governmental institutions such as the US and UK AI Safety Institutes (NIST,, 2024 ; UK AISI,, 2024 ) and the EU AI Office (European Parliament,, 2024 ) are also integrating evaluations into their AI risk management frameworks.

 
 
 However, developers may have incentives for an AI system to understate its performance on capability evaluations, in part to influence regulatory decis

... (truncated, 98 KB total)

Resource ID: 28f9d1d93970a72e | Stable ID: sid_91ryu8eHsn