Evaluating Logical Reasoning Ability of Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) such as ChatGPT and DeepSeek have recently made significant progress in natural language processing, demonstrating reasoning ability close to human intelligence. This has sparked considerable research interest since reasoning is a hallmark of human intelligence that is widely considered missed in artificial intelligence systems. Due to the large size of these models, evaluation of LLMs’ reasoning ability is largely empirical. Creating datasets to evaluate the reasoning ability of LLMs is an active research area. A key open question is whether LLMs reason or simply recite memorized texts they have encountered during their training phase. This work conducts simple experiments using Cheryl’s Birthday Puzzle and Cheryl’s Age Puzzle to investigate whether LLMs recite or reason and discovers that LLMs tend to recite memorized answers for well-known questions, which appear frequently on the internet. As a result, to accurately evaluate the reasoning ability of LLMs, it is essential to create new datasets to ensure that LLMs truly use their reasoning ability to generate responses to the presented questions. In view of the finding, this work proposes a new dataset comprising of questions requiring semantic and deductive logical reasoning skills to elicit reasoning ability from LLMs. The proposed evaluation framework has several desirable properties, including resilience to training data contamination, ease of response verification, extensibility, usefulness and automated test case generation. This work applies the proposed dataset to evaluate the reasoning ability of state-of-the-art LLMs, including GPT-3, GPT-4, Llama-3.1, Germini-1.5, Claude-3.5 and DeepSeek-V3. A significant observation is that most LLMs achieve a performance independent of question complexity. This suggests that they reason more like an algorithm than human intelligence. In contrast, DeepSeek-V3 resembles human reasoning behaviour most among all the tested LLMs. Finally, an algorithm to automatically generate the dataset of logical reasoning questions is presented.