Skip to content
Longterm Wiki

GPQA Diamond

Reasoning

Graduate-level Google-Proof Q&A Diamond subset — extremely difficult questions in physics, chemistry, and biology that even domain experts struggle with.

Models Tested
38
Best Score
91.3%
Median Score
62.5%
Scoring: accuracy
Introduced: 2023-11
Maintainer: David Rein et al.

Leaderboard38 models

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic
91.3%
🥈Claude Opus 4.5Anthropic
87%
🥉Gemini 2.5 ProGoogle DeepMind
84%
4Claude Sonnet 4.5Anthropic
83.4%
5o3OpenAI
83.3%
6Gemini 2.5 FlashGoogle DeepMind
82.8%
7o4-miniOpenAI
81.4%
8Claude Opus 4.1Anthropic
81%
9Grok-3xAI
80%
10o3-miniOpenAI
79.7%
11o1OpenAI
79.2%
12o1-previewOpenAI
78%
13Claude Opus 4Anthropic
74.1%
14Claude Sonnet 4.6Anthropic
74.1%
15DeepSeek R1DeepSeek
71.5%
16Claude Sonnet 4Anthropic
70.3%
17Llama 4 MaverickMeta AI (FAIR)
69.8%
18Claude 3.7 SonnetAnthropic
68%
19Claude 3.5 SonnetAnthropic
65%
20o1-miniOpenAI
60%
21DeepSeek V3DeepSeek
59.1%
22Llama 4 ScoutMeta AI (FAIR)
57.2%
23Gemini 2.0 FlashGoogle DeepMind
57%
24GPT-4.1OpenAI
56.4%
25Grok-2xAI
56.4%
26GPT-4oOpenAI
53.6%
27Llama 3.1Meta AI (FAIR)
50.7%
28Claude 3 OpusAnthropic
50.4%
29GPT-4 TurboOpenAI
49.3%
30Llama 3.3Meta AI (FAIR)
49.2%
31Mistral Large 2Mistral AI
43.9%
32Claude 3.5 HaikuAnthropic
41.6%
33Claude 3 SonnetAnthropic
40.4%
34GPT-4o miniOpenAI
39.8%
35Llama 3Meta AI (FAIR)
39.5%
36GPT-4OpenAI
35.7%
37Gemini 1.0 UltraGoogle DeepMind
35.4%
38Claude 3 HaikuAnthropic
33.3%