LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9
Recent tests reveal that popular large language models—including GPT‑4o, Gemini Advanced, and Claude 3.5—often claim 9.11 is larger than 9.9 because their tokenizers split the numbers, but rephrasing, zero‑shot chain‑of‑thought prompts, or treating the values as floating‑point numbers can correct the mistake, a pattern also seen variably in Chinese models.
Recent observations show that mainstream large language models (GPT‑4o, Gemini Advanced, Claude 3.5 Sonnet) incorrectly answer that 9.11 is larger than 9.9.
Prompt engineer Riley Goodside discovered the issue by asking “9.11 and 9.9 – which is bigger?” and found most models give the wrong answer.
The error persists across different phrasings, but reordering the options (placing the numbers before the question) often restores the correct answer.
Experiments with Chinese models (Kimi, ChatGLM, Tencent Yuanbao, ByteDance Doubao, Wenxin Yiyan) show a mix of failures and successes; Yuanbao and Doubao answer correctly, while others either give wrong conclusions or invoke web search.
Analysis reveals that tokenizers treat “9.11” as tokens “9”, “.”, “11”; the token for “11” receives a higher id, leading the model to compare 11 > 9 and mistakenly conclude 9.11 > 9.9.
Providing the problem as a double‑precision floating‑point number or using zero‑shot chain‑of‑thought prompting can guide the model to the right answer.
Recent Reuters report mentions OpenAI’s internal “Strawberry” model scoring over 90 % on the MATH benchmark, but it is unclear whether it can solve the 9.11 vs 9.9 question without extra prompting.
The case illustrates how subtle tokenization and prompt design affect LLM reasoning on seemingly trivial arithmetic.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.