Overview of Major Benchmark Datasets for Evaluating Large Language Models
This article provides a comprehensive overview of major benchmark datasets—including CMMLU, MMLU, C‑Eval, GSM8K, Gaokao‑Bench, AGIEval, MATH, BBH, HumanEval, and MBPP—used to evaluate large language models' knowledge, reasoning, and coding abilities, and summarizes related leaderboards and evaluation tools.
Introduction
At the 2023 Cloud Expo, Alibaba Cloud released the trillion-parameter large model Tongyi Qianwen 2.0. According to the event, in ten authoritative benchmarks, Tongyi Qianwen 2.0 outperformed GPT‑3.5 overall and is closing the gap with GPT‑4. The following table shows its scores on MMLU, C‑Eval, GSM8K, HumanEval, MATH and other benchmarks.
The figure shows that Tongyi Qianwen 2.0 generally surpasses META's Llama‑2‑70B, wins nine out of ten against OpenAI's Chat‑3.5, and loses four out of ten to GPT‑4, indicating a further narrowing of the gap (source: Sina Finance).
What are these benchmark datasets and what aspects do they focus on?
Benchmark Dataset Introduction
CMMLU
CMMLU is a Chinese‑focused benchmark for assessing knowledge and reasoning of large language models, jointly created by MBZUAI, Shanghai Jiao Tong University and Microsoft Research Asia. It contains 67 subjects covering natural sciences, social sciences, engineering and humanities, and is one of the two most authoritative domestic evaluations.
Paper: CMMLU: Measuring massive multitask language understanding in Chinese
Data, code and latest leaderboard: https://github.com/haonan-li/CMMLU
MMLU
MMLU (Massive Multitask Language Understanding) was introduced by Hendrycks et al. in the paper “Measuring Massive Multitask Language Understanding”. It evaluates pretrained models in zero‑shot and few‑shot settings across a wide range of subjects.
Website: https://paperswithcode.com/dataset/mmlu
Paper: MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING
Leaderboard: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
C‑Eval
C‑Eval is a comprehensive Chinese evaluation suite built by Tsinghua University, Shanghai Jiao Tong University and the University of Edinburgh. It covers 52 disciplines, containing 13,948 multiple‑choice questions at four difficulty levels, and is one of the leading Chinese LLM benchmarks.
Paper: C‑Eval: A Multi‑Level Multi‑Discipline Chinese Evaluation Suite for Foundation Models
Website: https://cevalbenchmark.com/
GitHub: https://github.com/hkust-nlp/ceval/
Leaderboard: (link)
GSM8K
GSM8K is an OpenAI‑released benchmark for mathematical reasoning, consisting of 8.5K high‑quality elementary‑school word problems (7.5K training, 1K test). Each problem requires 2‑8 reasoning steps and basic arithmetic operations ( +-/* ).
It is one of the two well‑known math‑reasoning benchmarks, first released in October 2021 and remains challenging.
Background : Large models like GPT‑3 exhibit impressive abilities but struggle with precise multi‑step reasoning tasks such as elementary math word problems. OpenAI created GSM8K to evaluate and improve this capability.
Paper: Training Verifiers to Solve Math Word Problems
Project: https://github.com/openai/grade-school-math
Blog: https://openai.com/research/solving-math-word-problems
Gaokao‑Bench
Gaokao‑Bench uses Chinese college‑entrance exam questions (2010‑2022) to evaluate language understanding and logical reasoning of LLMs. It contains 1,781 multiple‑choice, 218 fill‑in‑the‑blank and 812 answer‑type questions, with both automated objective scoring and expert‑rated subjective scoring.
Website: https://github.com/OpenLMLab/GAOKAO-Bench
Paper: Evaluating the Performance of Large Language Models on GAOKAO Benchmark
AGIEval
Released by Microsoft in April 2023, AGIEval assesses general intelligence and problem‑solving abilities of LLMs across 20 official, public and high‑standard admission exams in multiple languages.
Paper: AGIEval: A Human‑Centric Benchmark for Evaluating Foundation Models
Data: https://github.com/microsoft/AGIEval
MATH
MATH is a UC Berkeley benchmark for mathematical problem solving, containing 12,500 high‑school competition problems with detailed step‑by‑step solutions, used to train models to generate reasoning traces. It remains highly challenging for current models.
Project: https://github.com/hendrycks/math
Paper: Measuring Mathematical Problem Solving With the MATH Dataset
BBH
BIG‑bench Hard (BBH) is a subset of BIG‑bench focusing on tasks where current LLMs perform worse than humans, highlighting current limitations.
BIG‑bench is a collaborative benchmark covering 204 tasks across linguistics, child development, math, common‑sense reasoning, biology, physics, social bias, software development, etc. Scaling models can achieve near‑human performance on 65 % of tasks in few‑shot settings.
Paper: Challenging BIG‑Bench Tasks and Whether Chain‑of‑Thought Can Solve Them
GitHub: https://github.com/suzgunmirac/BIG-Bench-Hard
HumanEval
HumanEval measures functional correctness of code generated from docstrings. It consists of 164 programming problems covering language understanding, algorithms and basic math, comparable to simple software interview questions.
Paper: https://arxiv.org/abs/2107.03374
GitHub: https://github.com/openai/human-eval
MBPP
MBPP contains about 1,000 crowd‑sourced Python programming problems aimed at beginner programmers, covering basic coding knowledge and standard library usage. Each entry includes a task description, solution code, and three automated test cases, reflecting LLM code understanding and generation abilities.
Paper: Program Synthesis with Large Language Models
GitHub: https://github.com/.../mbpp
Appendix
Leaderboards
UC Berkeley‑led LLM Ranking
LMSYS Org, initiated by UC Berkeley researchers, runs a ranking tournament where LLMs battle and are ranked by Elo scores.
Website: https://lmsys.org/projects/
Online demo: https://chat.lmsys.org/
The tournament uses MT‑Bench as the chat‑bot evaluation benchmark.
One of the founders, Sheng Ying, co‑authored FlexGen, a system that can run 175B models on a single GPU and has earned 8 k stars; she is a PhD student at Stanford CS. The other founders are Lianmin Zheng and Hao Zhang.
AlpacaEval
GitHub: https://github.com/tatsu-lab/alpaca_eval
Leaderboard: Alpaca Eval Leaderboard
OpenCompass
Website: https://opencompass.org.cn
Leaderboard: https://opencompass.org.cn/leaderboard-llm
MT‑Bench
MT‑Bench is a carefully designed benchmark containing 80 high‑quality multi‑turn questions across eight categories: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM) and Knowledge II (Humanities), totaling 160 questions.
The radar chart below shows the 2023 LMSYS Org leaderboard.
Project details:
Writing
Humanities
Roleplay
STEM
Reasoning
Extraction
Math
Coding
MathVista
MathVista, released by Microsoft, is a multimodal math‑reasoning benchmark accompanied by a 112‑page evaluation report, focusing on the mathematical abilities of large multimodal models. It remains challenging even for state‑of‑the‑art models such as GPT‑4V.
Paper: https://arxiv.org/abs/2310.02255
Project: https://mathvista.github.io/
HF dataset: https://huggingface.co/datasets/AI4Math/MathVista
Visualization: https://mathvista.github.io/#visualization
Leaderboard: https://mathvista.github.io/#leaderboard
Survey Paper on LLM Evaluation
Paper: A Survey on Evaluation of Large Language Models
Contributions and additional resources are welcome.
References
https://blog.csdn.net/qq_18846849/article/details/127547883
https://baijiahao.baidu.com/s?id=1782446277193428846&wfr=spider&for=pc
https://zhuanlan.zhihu.com/p/643086466?utm_id=0
https://opencompass.org.cn/ability
(There may be omissions; contributions are welcome.)
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.