GPQA Diamond
ReasoningGraduate-level Google-Proof Q&A Diamond subset — extremely difficult questions in physics, chemistry, and biology that even domain experts struggle with.
Models Tested
8
Best Score
84%
Median Score
79.85%
Scoring: accuracy
Introduced: 2023-11
Maintainer: David Rein et al.
Leaderboard8 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Gemini 2.5 Pro | Google DeepMind | 84% |
| 🥈 | o3 | OpenAI | 83.3% |
| 🥉 | o4-mini | OpenAI | 81.4% |
| 4 | Grok-3 | xAI | 80% |
| 5 | o3-mini | OpenAI | 79.7% |
| 6 | o1 | OpenAI | 79.2% |
| 7 | o1-preview | OpenAI | 78% |
| 8 | DeepSeek R1 | DeepSeek | 71.5% |