Tice et al. 2024

paper

2024·arXiv·arxiv.org/abs/2412.01784

Authors

Cameron Tice·Philipp Alexander Kreer·Nathan Helm-Burger·Prithviraj Singh Shahani·Fedor Ryzhenkov·Fabien Roger·Clement Neo·Jacob Haimes·Felix Hofstätter·Teun van der Weij

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper addresses sandbagging—where AI models deliberately underperform during evaluations—and proposes detection methods, which is critical for ensuring reliable capability assessments of frontier AI systems.

Paper Details

Citations

0 influential

Year

2024

arXiv:2412.01784 DOI:10.48550/arXiv.2412.01784 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation. Sandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model architectures, sizes, and sandbagging techniques, we establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, we find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so. Our findings provide a practical tool for AI evaluation and oversight, addressing a challenge in ensuring accurate capability assessment of frontier AI systems.

Summary

This paper addresses the problem of sandbagging—where AI models deliberately or unintentionally underperform during capability evaluations—by proposing a novel detection method based on weight noise injection. The authors demonstrate that sandbagging models exhibit anomalous performance improvements when subjected to increasing levels of noise, in contrast to non-sandbagging models which show predictable degradation. This distinctive response pattern serves as a model-agnostic signal for detecting sandbagging across various architectures and sizes. The method also shows promise in eliciting full performance from models that underperform without explicit instruction, providing a practical tool for AI evaluation and oversight.

Cited by 1 page

Page	Type	Quality
AI Capability Sandbagging	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202648 KB

[2412.01784] Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

 
 
 Cameron Tice
 
    
 Philipp Alexander Kreer
 
    
 Nathan Helm-Burger
 
    
 Prithviraj Singh Shahani
 
    
 Fedor Ryzhenkov
 
    
 Jacob Haimes
 
    
 Felix Hofstätter
 
    
 Teun van der Weij
 
 

 
 Abstract

 Capability evaluations play a critical role in ensuring the safe deployment of frontier AI systems, but this role may be undermined by intentional underperformance or “sandbagging.” We present a novel model-agnostic method for detecting sandbagging behavior using noise injection. Our approach is founded on the observation that introducing Gaussian noise into the weights of models either prompted or fine-tuned to sandbag can considerably improve their performance. We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP). Our results demonstrate that noise injected sandbagging models show performance improvements compared to standard models. Leveraging this effect, we develop a classifier that consistently identifies sandbagging behavior. Our unsupervised technique can be immediately implemented by frontier labs or regulatory bodies with access to weights to improve the trustworthiness of capability evaluations.

 
 AI Safety, AI Governance, Sandbagging, Capability Evaluations, Noise Injection
 
 
 
 
 
 
 1 Introduction

 
 Capability evaluations are central to ensuring the safe deployment and continued development of frontier AI systems. More precisely, capability evaluations are used by AI companies ( DeepMind , 2024 ; OpenAI , 2023 ; Anthropic , 2023 ) , and governments ( NIST , 2024 ; UK AISI , 2024 ; European Parliament , 2024 ) to assess the risk level of models. These risk assessments guide the formulation of policies concerning the development, deployment, and monitoring of AI systems ( SHENK , 2024 ) .

 
 
 Despite their importance, capability evaluations face a fundamental challenge: current incentive structures ( UK Department for Science , 2024 ) may encourage suppressing capabilities to avoid deployment-restricting regulation, a behavior known as sandbagging . We define sandbagging as strategic underperformance during evaluation ( van der Weij et al. , 2024 ; Benton et al. , 2024 ) . Where strategic implies the underperformance was not accidental but initiated by developers of the AI system or the AI system itself, and underperformance implies the capabilities exhibited during evaluation are meaningfully beneath the actual capabilities.

 
 
 This work may be particularly beneficial for detecting sandbagging behavior that results from pre-deployment training (pre-training, post-training reinforcement learning, or fine-tuning). Although evidence for this behavior remains unclear ( Carlsmith , 2023 ) there are conceivable paths for its attainment ( Hubinger 

... (truncated, 48 KB total)

Resource ID: a0a238c7e56a938f | Stable ID: sid_mjqpu42A66