AI Sandbagging: Language Models can Strategically Underperform on Evaluations

resource

Metadata

`id`	4a1e5f2bded4d079
`url`	arxiv.org/html/2406.07358v4
`title`	AI Sandbagging: Language Models can Strategically Underperform on Evaluations
`type`	paper
`summary`	—
`review`	—
`abstract`	—
`keyPoints`	—
`publicationId`	—
`authors`	—
`authorEntityIds`	—
`publishedDate`	—
`tags`	[]
`localFilename`	—
`credibilityOverride`	—
`fetchedAt`	—
`contentHash`	—
`stableId`	sid_X7Ria9U1MA
`fetchStatus`	—
`lastFetchedAt`	—
`archiveUrl`	—
`stance`	—
`contextNote`	—
`resourcePurpose`	—
`resourceSubtype`	—
`typeMetadata`	—
`publisherEntityId`	—
`relatedEntityIds`	—
`enrichmentStatus`	—
`enrichmentDate`	—
`importanceScore`	—
`contentLifecycle`	—

Debug info

Thing ID: sid_X7Ria9U1MA

Source Table: resources

Source ID: 4a1e5f2bded4d079