Longterm Wiki

MMLU

Knowledge

Massive Multitask Language Understanding — a multiple-choice benchmark covering 57 academic subjects from STEM to humanities.

Models Tested
18
Best Score
92.2%
Median Score
87.15%
Scoring: accuracy
Introduced: 2021-01
Maintainer: Dan Hendrycks et al.

Leaderboard18 models

#ModelDeveloperScore
🥇Llama 4 MaverickMeta AI (FAIR)92.2%
🥈Gemini 1.0 UltraGoogle DeepMind90%
🥉Gemini 2.0 FlashGoogle DeepMind89.7%
4Llama 4 ScoutMeta AI (FAIR)89.3%
5Claude 3.5 SonnetAnthropic88.7%
6GPT-4oOpenAI88.7%
7Llama 3.1Meta AI (FAIR)88.6%
8DeepSeek V3DeepSeek88.5%
9Grok-2xAI87.5%
10Claude 3 OpusAnthropic86.8%
11GPT-4OpenAI86.4%
12Llama 3.3Meta AI (FAIR)86%
13Gemini 1.5 ProGoogle DeepMind85.9%
14Mistral Large 2Mistral AI84%
15GPT-4o miniOpenAI82%
16Llama 3Meta AI (FAIR)82%
17Mixtral 8x7BMistral AI70.6%
18Llama 2Meta AI (FAIR)68.9%