MMLU
KnowledgeMassive Multitask Language Understanding — a multiple-choice benchmark covering 57 academic subjects from STEM to humanities.
Models Tested
18
Best Score
92.2%
Median Score
87.15%
Scoring: accuracy
Introduced: 2021-01
Maintainer: Dan Hendrycks et al.
Leaderboard18 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Llama 4 Maverick | Meta AI (FAIR) | 92.2% |
| 🥈 | Gemini 1.0 Ultra | Google DeepMind | 90% |
| 🥉 | Gemini 2.0 Flash | Google DeepMind | 89.7% |
| 4 | Llama 4 Scout | Meta AI (FAIR) | 89.3% |
| 5 | Claude 3.5 Sonnet | Anthropic | 88.7% |
| 6 | GPT-4o | OpenAI | 88.7% |
| 7 | Llama 3.1 | Meta AI (FAIR) | 88.6% |
| 8 | DeepSeek V3 | DeepSeek | 88.5% |
| 9 | Grok-2 | xAI | 87.5% |
| 10 | Claude 3 Opus | Anthropic | 86.8% |
| 11 | GPT-4 | OpenAI | 86.4% |
| 12 | Llama 3.3 | Meta AI (FAIR) | 86% |
| 13 | Gemini 1.5 Pro | Google DeepMind | 85.9% |
| 14 | Mistral Large 2 | Mistral AI | 84% |
| 15 | GPT-4o mini | OpenAI | 82% |
| 16 | Llama 3 | Meta AI (FAIR) | 82% |
| 17 | Mixtral 8x7B | Mistral AI | 70.6% |
| 18 | Llama 2 | Meta AI (FAIR) | 68.9% |