Artificial Intelligence 14 min read

Overview of Major Benchmark Datasets for Evaluating Large Language Models

This article provides a comprehensive overview of major benchmark datasets—including CMMLU, MMLU, C‑Eval, GSM8K, Gaokao‑Bench, AGIEval, MATH, BBH, HumanEval, and MBPP—used to evaluate large language models' knowledge, reasoning, and coding abilities, and summarizes related leaderboards and evaluation tools.

Rare Earth Juejin Tech Community

Dec 29, 2023

Overview of Major Benchmark Datasets for Evaluating Large Language Models

Introduction

At the 2023 Cloud Expo, Alibaba Cloud released the trillion-parameter large model Tongyi Qianwen 2.0. According to the event, in ten authoritative benchmarks, Tongyi Qianwen 2.0 outperformed GPT‑3.5 overall and is closing the gap with GPT‑4. The following table shows its scores on MMLU, C‑Eval, GSM8K, HumanEval, MATH and other benchmarks.

The figure shows that Tongyi Qianwen 2.0 generally surpasses META's Llama‑2‑70B, wins nine out of ten against OpenAI's Chat‑3.5, and loses four out of ten to GPT‑4, indicating a further narrowing of the gap (source: Sina Finance).

What are these benchmark datasets and what aspects do they focus on?

Benchmark Dataset Introduction

CMMLU

CMMLU is a Chinese‑focused benchmark for assessing knowledge and reasoning of large language models, jointly created by MBZUAI, Shanghai Jiao Tong University and Microsoft Research Asia. It contains 67 subjects covering natural sciences, social sciences, engineering and humanities, and is one of the two most authoritative domestic evaluations.

Paper: CMMLU: Measuring massive multitask language understanding in Chinese

Data, code and latest leaderboard: https://github.com/haonan-li/CMMLU

MMLU

MMLU (Massive Multitask Language Understanding) was introduced by Hendrycks et al. in the paper “Measuring Massive Multitask Language Understanding”. It evaluates pretrained models in zero‑shot and few‑shot settings across a wide range of subjects.

Website: https://paperswithcode.com/dataset/mmlu

Paper: MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING

Leaderboard: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu

C‑Eval

C‑Eval is a comprehensive Chinese evaluation suite built by Tsinghua University, Shanghai Jiao Tong University and the University of Edinburgh. It covers 52 disciplines, containing 13,948 multiple‑choice questions at four difficulty levels, and is one of the leading Chinese LLM benchmarks.

Paper: C‑Eval: A Multi‑Level Multi‑Discipline Chinese Evaluation Suite for Foundation Models

Website: https://cevalbenchmark.com/

GitHub: https://github.com/hkust-nlp/ceval/

Leaderboard: (link)

GSM8K

GSM8K is an OpenAI‑released benchmark for mathematical reasoning, consisting of 8.5K high‑quality elementary‑school word problems (7.5K training, 1K test). Each problem requires 2‑8 reasoning steps and basic arithmetic operations ( +-/*).

It is one of the two well‑known math‑reasoning benchmarks, first released in October 2021 and remains challenging.

Background : Large models like GPT‑3 exhibit impressive abilities but struggle with precise multi‑step reasoning tasks such as elementary math word problems. OpenAI created GSM8K to evaluate and improve this capability.

Paper: Training Verifiers to Solve Math Word Problems

Project: https://github.com/openai/grade-school-math

Blog: https://openai.com/research/solving-math-word-problems

Gaokao‑Bench

Gaokao‑Bench uses Chinese college‑entrance exam questions (2010‑2022) to evaluate language understanding and logical reasoning of LLMs. It contains 1,781 multiple‑choice, 218 fill‑in‑the‑blank and 812 answer‑type questions, with both automated objective scoring and expert‑rated subjective scoring.

Website: https://github.com/OpenLMLab/GAOKAO-Bench

Paper: Evaluating the Performance of Large Language Models on GAOKAO Benchmark

AGIEval

Released by Microsoft in April 2023, AGIEval assesses general intelligence and problem‑solving abilities of LLMs across 20 official, public and high‑standard admission exams in multiple languages.

Paper: AGIEval: A Human‑Centric Benchmark for Evaluating Foundation Models

Data: https://github.com/microsoft/AGIEval

MATH

MATH is a UC Berkeley benchmark for mathematical problem solving, containing 12,500 high‑school competition problems with detailed step‑by‑step solutions, used to train models to generate reasoning traces. It remains highly challenging for current models.

Project: https://github.com/hendrycks/math

Paper: Measuring Mathematical Problem Solving With the MATH Dataset

BBH

BIG‑bench Hard (BBH) is a subset of BIG‑bench focusing on tasks where current LLMs perform worse than humans, highlighting current limitations.

BIG‑bench is a collaborative benchmark covering 204 tasks across linguistics, child development, math, common‑sense reasoning, biology, physics, social bias, software development, etc. Scaling models can achieve near‑human performance on 65 % of tasks in few‑shot settings.

Paper: Challenging BIG‑Bench Tasks and Whether Chain‑of‑Thought Can Solve Them

GitHub: https://github.com/suzgunmirac/BIG-Bench-Hard

HumanEval

HumanEval measures functional correctness of code generated from docstrings. It consists of 164 programming problems covering language understanding, algorithms and basic math, comparable to simple software interview questions.

Paper: https://arxiv.org/abs/2107.03374

GitHub: https://github.com/openai/human-eval

MBPP

MBPP contains about 1,000 crowd‑sourced Python programming problems aimed at beginner programmers, covering basic coding knowledge and standard library usage. Each entry includes a task description, solution code, and three automated test cases, reflecting LLM code understanding and generation abilities.

Paper: Program Synthesis with Large Language Models

GitHub: https://github.com/.../mbpp

Appendix

Leaderboards

UC Berkeley‑led LLM Ranking

LMSYS Org, initiated by UC Berkeley researchers, runs a ranking tournament where LLMs battle and are ranked by Elo scores.

Website: https://lmsys.org/projects/

Online demo: https://chat.lmsys.org/

The tournament uses MT‑Bench as the chat‑bot evaluation benchmark.

One of the founders, Sheng Ying, co‑authored FlexGen, a system that can run 175B models on a single GPU and has earned 8 k stars; she is a PhD student at Stanford CS. The other founders are Lianmin Zheng and Hao Zhang.

AlpacaEval

GitHub: https://github.com/tatsu-lab/alpaca_eval

Leaderboard: Alpaca Eval Leaderboard

OpenCompass

Website: https://opencompass.org.cn

Leaderboard: https://opencompass.org.cn/leaderboard-llm

MT‑Bench

MT‑Bench is a carefully designed benchmark containing 80 high‑quality multi‑turn questions across eight categories: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM) and Knowledge II (Humanities), totaling 160 questions.

The radar chart below shows the 2023 LMSYS Org leaderboard.

Project details:

Writing

Humanities

Roleplay

STEM

Reasoning

Extraction

Math

Coding

MathVista

MathVista, released by Microsoft, is a multimodal math‑reasoning benchmark accompanied by a 112‑page evaluation report, focusing on the mathematical abilities of large multimodal models. It remains challenging even for state‑of‑the‑art models such as GPT‑4V.

Paper: https://arxiv.org/abs/2310.02255

Project: https://mathvista.github.io/

HF dataset: https://huggingface.co/datasets/AI4Math/MathVista

Visualization: https://mathvista.github.io/#visualization

Leaderboard: https://mathvista.github.io/#leaderboard

Survey Paper on LLM Evaluation

Paper: A Survey on Evaluation of Large Language Models

Contributions and additional resources are welcome.

References

https://blog.csdn.net/qq_18846849/article/details/127547883

https://baijiahao.baidu.com/s?id=1782446277193428846&wfr=spider&for=pc

https://zhuanlan.zhihu.com/p/643086466?utm_id=0

https://opencompass.org.cn/ability

(There may be omissions; contributions are welcome.)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence LLM evaluation dataset

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.