MMLU Benchmark Overview - Stanford CRFM

web

crfm.stanford.edu·crfm.stanford.edu/2024/05/01/helm-mmlu.html

Relevant for AI safety researchers tracking capability progression and evaluation methodology; MMLU is a widely-cited benchmark whose results inform debates about model capability and safety readiness.

Metadata

Importance: 52/100blog postreference

Summary

Stanford CRFM's analysis of the Massive Multitask Language Understanding (MMLU) benchmark within the HELM evaluation framework, examining how frontier language models perform across 57 academic subjects. The resource provides standardized evaluation methodology and comparative results to help researchers assess LLM capabilities reliably and reproducibly.

Key Points

•MMLU tests LLMs across 57 subjects spanning STEM, humanities, social sciences, and professional domains to assess breadth of knowledge.
•Stanford CRFM integrates MMLU into HELM (Holistic Evaluation of Language Models) for standardized, reproducible benchmarking of frontier models.
•Highlights how benchmark results can vary significantly based on prompting methodology, scoring approach, and evaluation setup.
•Serves as a key reference point for tracking capability progression of large language models over time.
•Raises concerns about benchmark saturation as top models approach ceiling performance on MMLU.

Review

The HELM MMLU project critically examines the current landscape of Massive Multitask Language Understanding (MMLU) benchmark evaluations, highlighting significant methodological inconsistencies in how language models report their performance. By introducing a comprehensive, standardized evaluation framework, the researchers aim to create a more reliable and comparable method for assessing language model capabilities across 57 academic subjects. The project's key contribution lies in its emphasis on transparency, standardized prompting, and open-source evaluation. By using the HELM framework, the researchers were able to reveal discrepancies between model creators' reported scores and their independent evaluations, with some scores differing by up to 5 percentage points. This approach not only provides a more rigorous assessment of language models but also promotes reproducibility and accountability in AI research, potentially helping to address concerns about inflated or non-comparable performance claims.

Cited by 1 page

Page	Type	Quality
Minimal Scaffolding	Capability	52.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20268 KB

Stanford CRFM 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 


 
 
 

 Massive Multitask Language Understanding (MMLU) on HELM


 
 
 
 Authors: 
 
 Yifan Mai and 
 
 Percy Liang 
 
 


 

 

 

 

 
 

 
Despite the prominence of the Massive Multitask Language Understanding (MMLU) benchmark, MMLU scores reported by model creators are frequently produced in inconsistent or problematic ways, which hinder their comparability. To address this, we introduce HELM MMLU , a leaderboard with evaluation results from evaluating various language models on MMLU using HELM. Our evaluation results feature simple and standardized prompts, an accuracy breakdown for each of the 57 subjects, and full transparency of all raw prompts and predictions.
 

 Motivation

 Massive Multitask Language Understanding (MMLU) (Hendrycks et al, 2020) is a multiple-choice question answering test that covers 57 tasks including elementary mathematics, US history, computer science, law, and more. MMLU scores are reported prominently in the evaluation of virtually all language models, as well as on many leaderboards, including the Open LLM Leaderboard , HELM Classic and HELM Lite .

 Despite the prominence of the MMLU benchmark, there have been a number of issues with recent MMLU evaluations that hinder the comparability of MMLU scores:

 
 Model creators have reported MMLU scores using non-standard prompting techniques. For instance, a Google blog post on Google blog post on Gemini compared the Gemini Ultra’s MMLU score using “Chain-of-Thought with uncertainty routing” with GPT-4’s MMLU score using regular 5-shot in-context learning. This was not a controlled comparison, because Gemini Ultra’s MMLU score was significantly boosted by using this method instead of regular 5-shot in-context learning (90.04% vs 83.7%).

 Third-party researchers reported lower MMLU scores on models than the MMLU scores provided by model creators. For example, Akter et al., 2023 reported a 5-shot MMLU score of 65.22%, whereas the Gemini paper reported a 5-shot MMLU score of 71.8%.

 Model creators provided insufficient information about prompting templates. Evaluation results for language models can be highly sensitive to variations in prompting, including task instructions and the choice of in-context examples. Yet many model papers do not provide enough information about the prompts for a third-party researcher to recreate them.

 Model creators did not use open source evaluation frameworks. When evaluations are performed using an open-source evaluation framework such as HELM , LLM Evaluation Harness or Unitxt , the evaluation results can be directly compared with results for a different model on the same framework. Furthermore, a third-party researcher can use the framework to recreate the precise prompts used for evaluation. Most model creators did not report using an open source evaluation framework for their MMLU evaluations.

 Model creators sometimes performed evaluations 

... (truncated, 8 KB total)

Resource ID: 0f91a062039eabb8 | Stable ID: sid_vPCc459JFY