Artificial Intelligence 9 min read

Large Language Models Lack Formal Reasoning Ability: Five Pieces of Evidence from the GSM‑Symbolic Benchmark

Recent research by Apple’s Iman Mirzadeh team introduces the GSM‑Symbolic benchmark, revealing that large language models, despite high scores on GSM8K, exhibit significant performance drops when problem numbers, names, or extra clauses change, indicating a lack of true formal reasoning ability.

Cognitive Technology Team

Oct 16, 2024

Large Language Models Lack Formal Reasoning Ability: Five Pieces of Evidence from the GSM‑Symbolic Benchmark

In recent years, large language models (LLMs) have achieved impressive results on many tasks, prompting the question of whether they truly possess logical reasoning capabilities or merely rely on sophisticated pattern matching. To investigate this, Iman Mirzadeh and his team at Apple proposed a new benchmark called GSM‑Symbolic , evaluating both open‑source models (Llama, Phi, Gemma, Mistral) and closed‑source models (GPT‑4o, o1 series).

The evaluation uncovered five key pieces of evidence that LLMs do not exhibit formal reasoning:

1. Unreliable GSM8K accuracy – Model performance on GSM8K fluctuates widely (e.g., Llama 8B varies between 70%‑80%, Phi‑3 between 75%‑90%), showing that high scores on this benchmark do not guarantee genuine reasoning ability.

2. Sensitivity to name and number changes – Altering a proper name or numeric value in a problem can cause accuracy to drop by up to 10%, demonstrating extreme brittleness to superficial variations.

3. Performance degradation with increased difficulty – Introducing or removing clauses (GSM‑M1, GSM‑Symb, GSM‑P1, GSM‑P2) leads to sharp declines in accuracy and higher variance, indicating that models struggle as problem complexity grows.

4. Large impact of irrelevant clauses – Adding a seemingly related but actually irrelevant sentence (the GSM_NoOp experiment) causes all models, including the advanced o1 series, to suffer notable performance drops, further evidencing reliance on pattern matching rather than understanding.

5. Scaling does not solve the core issue – Expanding data, model size, or compute improves scores modestly but merely produces better pattern matchers; it does not endow models with true symbolic or logical inference.

The authors argue that these findings have important implications for AI safety, education, healthcare, and decision‑making systems that demand reliable reasoning. Developing more robust evaluation methods and moving beyond pattern‑matching approaches are essential for the next generation of LLMs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark AI safety GSM‑Symbolic Mathematical Reasoning

Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.