Artificial Intelligence 7 min read

Can AI Models Pass the Chinese Math Gaokao? A Fair, Objective Test

The author conducts a transparent, objective assessment of several large language models on the 2025 Chinese national math exam, converting all questions to LaTeX, applying strict Gaokao scoring rules, and revealing each model's strengths and weaknesses across single‑choice, multiple‑choice, and fill‑in‑the‑blank items.

DataFunTalk

Jun 9, 2025

Can AI Models Pass the Chinese Math Gaokao? A Fair, Objective Test

Recently many media reported AI taking the Gaokao exam. I decided to conduct a fair, objective test of large language models on the 2025 national math paper.

Test Rules

1. No open‑ended questions because I cannot understand the answer key.

2. All question screenshots were converted to LaTeX text using a LaTeX editor before feeding to the models, ensuring the test measures mathematical reasoning rather than OCR accuracy.

3. Question 6 (the only one with a diagram) was removed to avoid ambiguity.

4. Scoring follows Gaokao rules: single‑choice (7 questions, 5 points each), multiple‑choice (3 questions, 6 points each, full credit only), fill‑in‑the‑blank (3 questions, 5 points each).

5. Each question was run three times per model; the final score is the average of correct ratios to reduce hallucinations. Example: OpenAI o3 answered a single‑choice question correctly twice and incorrectly once, earning 5 × 0.66 = 3.3 points.

6. Only reasoning was allowed; prompts, internet access, and code execution were disabled.

Models Tested

The models evaluated were OpenAI o3, Gemini 2.5 pro, DeepSeek R1, Doubao 1.5‑thinking‑pro, Yuanbao (HunYuan T1), Qwen‑3, and Xunfei Spark X1, all inference‑only models.

Testing was performed overnight (2 am–4 am) by manually copying LaTeX into the APIs.

Results

The test comprised 7 single‑choice, 3 multiple‑choice, and 3 fill‑in‑the‑blank questions, total 68 points.

Gemini 2.5 pro answered every question correctly and topped the ranking.

Doubao, Yuanbao, and Spark formed the second tier, missing one option on question 9.

DeepSeek R1 was half‑correct on a multiple‑choice question, losing 0.7 points and ranking fifth.

Qwen‑3 and OpenAI o3 each missed one fill‑in‑the‑blank question, placing at the bottom.

The experiment shows that most current reasoning models can handle the Gaokao math paper with only minor hallucinations, and that conversion to LaTeX eliminates OCR‑related errors.

For example, the set complement notation \(\complement_{U} A\) was mis‑read as “CuA” by OCR.

Overall, this fair and objective AI math Gaokao test demonstrates the current capabilities and limits of large language models in mathematical reasoning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models model evaluation Gaokao AI benchmarking math exam

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.