Longterm Wiki

GPQA Diamond

Reasoning

Graduate-level Google-Proof Q&A Diamond subset — extremely difficult questions in physics, chemistry, and biology that even domain experts struggle with.

Models Tested
8
Best Score
84%
Median Score
79.85%
Scoring: accuracy
Introduced: 2023-11
Maintainer: David Rein et al.

Leaderboard8 models

#ModelDeveloperScore
🥇Gemini 2.5 ProGoogle DeepMind84%
🥈o3OpenAI83.3%
🥉o4-miniOpenAI81.4%
4Grok-3xAI80%
5o3-miniOpenAI79.7%
6o1OpenAI79.2%
7o1-previewOpenAI78%
8DeepSeek R1DeepSeek71.5%