Skip to content
Longterm Wiki
Back

MMLU-Pro Benchmark Leaderboard – Artificial Analysis

web

MMLU-Pro is an enhanced AI benchmark leaderboard tracking model performance on graduate-level reasoning tasks, relevant to AI safety as it provides more discriminative evaluation of advanced model capabilities beyond saturated benchmarks.

Metadata

Importance: 45/100tool pagereference

Summary

This page presents the MMLU-Pro benchmark leaderboard hosted by Artificial Analysis, featuring independently conducted evaluations of language models on 12,000 graduate-level questions across 14 subjects with ten answer choices. MMLU-Pro addresses saturation in the original MMLU by emphasizing deeper reasoning over knowledge recall, causing accuracy drops of 16–33% compared to MMLU. It also demonstrates greater prompt stability and rewards Chain-of-Thought reasoning, making it a more discriminative tool for tracking AI progress.

Key Points

  • MMLU-Pro expands the original MMLU to 12,000 graduate-level questions with 10 answer choices instead of 4, reducing model saturation.
  • Models score 16–33% lower on MMLU-Pro than MMLU, better differentiating advanced language model capabilities.
  • Prompt sensitivity drops from 4–5% in MMLU to ~2% in MMLU-Pro, indicating more robust and stable evaluation.
  • Chain-of-Thought reasoning outperforms direct answering on MMLU-Pro, confirming the benchmark's focus on complex reasoning.
  • Evaluations are independently conducted by Artificial Analysis, providing a third-party leaderboard for model comparison.

Cited by 1 page

PageTypeQuality
Eval Saturation & The Evals GapApproach65.0

Cached Content Preview

HTTP 200Fetched Apr 24, 202611 KB
Artificial Analysis All evaluations MMLU-Pro Benchmark Leaderboard

 An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements. Background 

 An enhanced version of the original MMLU benchmark that addresses model saturation by expanding to 12,000 graduate-level questions with ten answer choices instead of four. MMLU-Pro emphasizes deeper reasoning over knowledge recall, creating a more challenging evaluation that better discriminates between advanced language models. Methodology 

 All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page. Publication

 View on arXiv MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

 Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen . Abstract 

 In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

 In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enha

... (truncated, 11 KB total)
Resource ID: 3b388d601f25992d | Stable ID: sid_6TUchEcl5w