Longterm Wiki

MATH

Math

A dataset of 12,500 competition mathematics problems testing mathematical reasoning across difficulty levels 1-5.

Models Tested
10
Best Score
99.2%
Median Score
96.85%
Scoring: accuracy
Introduced: 2021-03
Maintainer: Dan Hendrycks et al.

Leaderboard10 models

#ModelDeveloperScore
🥇o3OpenAI99.2%
🥈o4-miniOpenAI98.5%
🥉o3-miniOpenAI97.9%
4Gemini 2.5 ProGoogle DeepMind97.3%
5DeepSeek R1DeepSeek97.3%
6o1OpenAI96.4%
7Grok-3xAI95%
8o1-previewOpenAI94.8%
9DeepSeek V3DeepSeek90.2%
10o1-miniOpenAI90%