MMLU

In artificial intelligence, Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of large language models.

Benchmark

It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024.^[1]^[2]

The MMLU was released by Dan Hendrycks and a team of researchers in 2020^[3] and was designed to be more challenging than then-existing benchmarks such as General Language Understanding Evaluation (GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy.^[3] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy.^[3] As of 2024, some of the most powerful language models, such as o1, Gemini and Claude 3, were reported to achieve scores around 90%.^[4]^[5]

Examples

The following examples are taken from the "Abstract Algebra" and "International Law" tasks, respectively.^[3] The correct answers are marked in boldface:

Find all $c$ in $\mathbb {Z} _{3}$ such that $\mathbb {Z} _{3}[x]/(x^{2}+c)$ is a field.
(A) 0 (B) 1 (C) 2 (D) 3

Would a reservation to the definition of torture in the ICCPR be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties

Leaderboard

Caption text
Organisation	LLM	MMLU
OpenAI	o1	90.8^[4]
Rubik's AI	Nova-Pro	88.8
Anthropic	Claude 3.5 Sonnet	88.7
Meta	Llama-3.1 405B	88.6
xAI	Grok-2	87.5
Anthropic	Claude 3 Opus	86.8
Meta	Llama-3.1 70B	86.0
Google	Gemini-1.5 Pro	85.9
Inflection	Inflection-2.5	85.5
Mistral	Mistral Large 2	84.0
Reka	Reka Core	83.2
AI21	Jamba-1.5 Large	81.2

References

^ Roose, Kevin (15 April 2024). "A.I. Has a Measurement Problem". The New York Times.
^ "MMLU Dataset". HuggingFace. 24 July 2024.
^ ^a ^b ^c ^d Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding". arXiv:2009.03300 [cs.CY].
^ ^a ^b OpenAI o1 System Card. OpenAI. p. 33. Retrieved 13 September 2024.
^ "Multi-task Language Understanding on MMLU | Leaderboard". Papers with Code. Retrieved 2024-10-10.

[nyt-1] Roose, Kevin (15 April 2024). "A.I. Has a Measurement Problem". The New York Times.

[huggingface-2] "MMLU Dataset". HuggingFace. 24 July 2024.

[paper-3] Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding". arXiv:2009.03300 [cs.CY].

[:0-4] OpenAI o1 System Card. OpenAI. p. 33. Retrieved 13 September 2024.

[5] "Multi-task Language Understanding on MMLU | Leaderboard". Papers with Code. Retrieved 2024-10-10.

[1]

[2]

[3]

[4]

[5]