Can AI Models Pass the Chinese Math Gaokao? A Fair, Objective Test
The author conducts a transparent, objective assessment of several large language models on the 2025 Chinese national math exam, converting all questions to LaTeX, applying strict Gaokao scoring rules, and revealing each model's strengths and weaknesses across single‑choice, multiple‑choice, and fill‑in‑the‑blank items.
Recently many media reported AI taking the Gaokao exam. I decided to conduct a fair, objective test of large language models on the 2025 national math paper.
Test Rules
1. No open‑ended questions because I cannot understand the answer key.
2. All question screenshots were converted to LaTeX text using a LaTeX editor before feeding to the models, ensuring the test measures mathematical reasoning rather than OCR accuracy.
3. Question 6 (the only one with a diagram) was removed to avoid ambiguity.
4. Scoring follows Gaokao rules: single‑choice (7 questions, 5 points each), multiple‑choice (3 questions, 6 points each, full credit only), fill‑in‑the‑blank (3 questions, 5 points each).
5. Each question was run three times per model; the final score is the average of correct ratios to reduce hallucinations. Example: OpenAI o3 answered a single‑choice question correctly twice and incorrectly once, earning 5 × 0.66 = 3.3 points.
6. Only reasoning was allowed; prompts, internet access, and code execution were disabled.
Models Tested
The models evaluated were OpenAI o3, Gemini 2.5 pro, DeepSeek R1, Doubao 1.5‑thinking‑pro, Yuanbao (HunYuan T1), Qwen‑3, and Xunfei Spark X1, all inference‑only models.
Testing was performed overnight (2 am–4 am) by manually copying LaTeX into the APIs.
Results
The test comprised 7 single‑choice, 3 multiple‑choice, and 3 fill‑in‑the‑blank questions, total 68 points.
Gemini 2.5 pro answered every question correctly and topped the ranking.
Doubao, Yuanbao, and Spark formed the second tier, missing one option on question 9.
DeepSeek R1 was half‑correct on a multiple‑choice question, losing 0.7 points and ranking fifth.
Qwen‑3 and OpenAI o3 each missed one fill‑in‑the‑blank question, placing at the bottom.
The experiment shows that most current reasoning models can handle the Gaokao math paper with only minor hallucinations, and that conversion to LaTeX eliminates OCR‑related errors.
For example, the set complement notation \(\complement_{U} A\) was mis‑read as “CuA” by OCR.
Overall, this fair and objective AI math Gaokao test demonstrates the current capabilities and limits of large language models in mathematical reasoning.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.