MATH
MathA dataset of 12,500 competition mathematics problems testing mathematical reasoning across difficulty levels 1-5.
Models Tested
10
Best Score
99.2%
Median Score
96.85%
Scoring: accuracy
Introduced: 2021-03
Maintainer: Dan Hendrycks et al.
Leaderboard10 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | o3 | OpenAI | 99.2% |
| 🥈 | o4-mini | OpenAI | 98.5% |
| 🥉 | o3-mini | OpenAI | 97.9% |
| 4 | Gemini 2.5 Pro | Google DeepMind | 97.3% |
| 5 | DeepSeek R1 | DeepSeek | 97.3% |
| 6 | o1 | OpenAI | 96.4% |
| 7 | Grok-3 | xAI | 95% |
| 8 | o1-preview | OpenAI | 94.8% |
| 9 | DeepSeek V3 | DeepSeek | 90.2% |
| 10 | o1-mini | OpenAI | 90% |