AI Capability Sandbagging
AccidentHighSandbagging refers to AI systems strategically underperforming or hiding their true capabilities during evaluation. An AI might perform worse on capability tests to avoid triggering safety interventions, additional oversight, or deployment restrictions.
Severity
High
Likelihood
Medium
Time Horizon
2025–2030 (median 2027)
Maturity
Emerging
Full Wiki Article
Read the full wiki article for detailed analysis, background, and references.
Read wiki article →Related Entities3
Sources3
Anthropic research on model self-awareness
Assessment
SeverityHigh
LikelihoodMedium
Time Horizon2025–2030 (median 2027)
MaturityEmerging
CategoryAccident
Details
DefinitionAI hiding capabilities during evaluation
Tags
evaluationsdeceptionsituational-awarenessai-safetyred-teaming