Apple researcher Mehrdad Farajtabar and others published a paper questioning the reasoning ability of Large Language Models (LLMs), suggesting that they are only capable of complex pattern matching and lack true logical reasoning. Despite the performance improvements of LLMs, Farajtabar believes that this does not necessarily indicate an improvement in reasoning ability. They developed the GSM-Symbolic tool to test the mathematical reasoning limits of LLMs, and the results showed that the accuracy of GSM8K is unreliable, with significant differences in performance among different models

Apple researcher Mehrdad Farajtabar and others recently published a paper sharply questioning the reasoning ability of Large Language Models (LLMs). He believes that the "reasoning" ability of LLMs is actually just complex pattern matching, which is not convincing!

The authors of the paper studied open-source models such as Llama, Phi, Gemma, Mistral, as well as closed-source models like GPT-4o and o1 series. It should be noted that in the three years since OpenAI released GSM8K, the model's performance has significantly improved, from 35% for GPT-3 (175B) to over 85% for the current 3 billion parameter model, with even larger models exceeding 95%. However, Farajtabar believes that this does not prove that the reasoning ability of LLMs has truly improved.

To test the limits of LLMs' mathematical reasoning abilities, Farajtabar and his team developed a new tool called GSM-Symbolic, which can create symbolic templates based on the GSM8K test set to generate a large number of instances and design controlled experiments. They generated 50 unique sets of GSM-Symbolic, which are essentially like GSM8K examples but with different values and names.

GSM8K stands for "Grade School Math 8K" and is a dataset used to evaluate mathematical problem-solving abilities. This dataset mainly consists of elementary-level math problems (approximately 8,000 questions) and is commonly used to train and test machine learning models, especially in how models in the field of natural language processing handle and solve math problems.

The experimental results are astonishing:

1. The current accuracy of GSM8K is not reliable! There is a huge difference in performance on GSM8K among different models, for example, Llama 8B scores between 70% and 80%, Phi-3 scores between 75% and 90%, and so on. For most models, the average performance on GSM-Symbolic is lower than the average performance on GSM8K

2. The so-called LLM reasoning ability is not up to par! LLM is very sensitive to changes in proper nouns and numbers, indicating that they do not truly understand mathematical concepts. Just like an elementary school student, if we only change the names in a math test, will their score drop by 10%? Obviously not.

3. As the difficulty of the questions increases (M1 → Symbolic → P1 → P2). Three new variants of GSM-Symbolic are introduced to study model behavior: deleting a clause (GSM-M1), adding a clause (GSM-P1), or adding two clauses (GSM-P2). The model's performance declines, and the variance increases, indicating that the model's reliability is deteriorating.

4. With the introduction of GSM-NoOp, the model's performance plummets dramatically! GSM-NoOp is based on GSM-Symbolic, adding a clause that seems relevant but does not affect overall reasoning. All models, including the o1 model, show significant performance declines. This indicates that even the powerful o1 model cannot truly understand the logical structure of mathematical problems.

5. Even OpenAI's o1 series models cannot completely avoid these issues. Although o1-preview has some improvements, it still makes some basic errors, such as failing to understand the difference between "now" and "last year," which may be due to the presence of "inflation" patterns in the training data, with the model simply mimicking these patterns.

Farajtabar believes that the performance of LLM is better explained as complex pattern matching rather than true logical reasoning. Even with increased data, parameters, and computations, or using better training data, it only results in "better pattern matchers" rather than "better reasoners".

Denny Zhou, the head of the LLM reasoning team at Google DeepMind, also participated in the discussion. He pointed out, "A key finding of this work is that adding irrelevant context to the GSM8k problem causes LLM to fail to solve these problems, as demonstrated in our paper 'Large language models are easily distracted by irrelevant context' at ICML 2023. The differences in prompts constructed are still interesting to me."

Yuandong Tian, the Director of Research Science at Meta AI, also expressed his views: "The core issue is: 1️⃣ With our domain knowledge, we can construct weights to enable LLM to reason well on specific problems; 2️⃣ However, gradient descent may not be able to learn such weights; 3️⃣ We still rely on gradient descent because it brings magic to many fields - if it becomes stupid in other fields, we are also helpless."

Conclusion

Overall, the research results in this paper did not find any evidence of reasoning in language models, including open-source models like Llama, Phi, Gemma, and Mistral, as well as recent leading closed-source models like OpenAI GPT-4o and o1 series. Their behavior can be better explained by complex pattern matching - so fragile that even changing the name would result in about a 10% change! We can expand data, parameters, and computations - or use better training data for Phi-4, Llama-4, GPT-5. But this may only result in "better pattern matchers" rather than "better reasoners".

Author: opencat, Source: AI Hanwuji, Original Title: "AI Reasoning Ability Crashes! Apple's Latest Paper: LLM is Just Complex Pattern Matching, Not True Logical Reasoning"

AI reasoning ability "crashes"! Apple's latest paper: LLM is just complex pattern matching, not true logical reasoning

The experimental results are astonishing: