Artificial Intelligence 7 min read

LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9

Recent tests reveal that popular large language models—including GPT‑4o, Gemini Advanced, and Claude 3.5—often claim 9.11 is larger than 9.9 because their tokenizers split the numbers, but rephrasing, zero‑shot chain‑of‑thought prompts, or treating the values as floating‑point numbers can correct the mistake, a pattern also seen variably in Chinese models.

Java Tech Enthusiast

Jul 16, 2024

LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9

Recent observations show that mainstream large language models (GPT‑4o, Gemini Advanced, Claude 3.5 Sonnet) incorrectly answer that 9.11 is larger than 9.9.

Prompt engineer Riley Goodside discovered the issue by asking “9.11 and 9.9 – which is bigger?” and found most models give the wrong answer.

The error persists across different phrasings, but reordering the options (placing the numbers before the question) often restores the correct answer.

Experiments with Chinese models (Kimi, ChatGLM, Tencent Yuanbao, ByteDance Doubao, Wenxin Yiyan) show a mix of failures and successes; Yuanbao and Doubao answer correctly, while others either give wrong conclusions or invoke web search.

Analysis reveals that tokenizers treat “9.11” as tokens “9”, “.”, “11”; the token for “11” receives a higher id, leading the model to compare 11 > 9 and mistakenly conclude 9.11 > 9.9.

Providing the problem as a double‑precision floating‑point number or using zero‑shot chain‑of‑thought prompting can guide the model to the right answer.

Recent Reuters report mentions OpenAI’s internal “Strawberry” model scoring over 90 % on the MATH benchmark, but it is unclear whether it can solve the 9.11 vs 9.9 question without extra prompting.

The case illustrates how subtle tokenization and prompt design affect LLM reasoning on seemingly trivial arithmetic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM prompt engineering tokenization AI evaluation numeric comparison

Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.