Evaluating Logical Reasoning Ability of Large Language Models

Emunah Chan

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) such as ChatGPT and DeepSeek have recently made significant progress in natural language processing, demonstrating reasoning ability close to human intelligence. This has sparked considerable research interest since reasoning is a hallmark of human intelligence that is widely considered missed in artificial intelligence systems. Due to the large size of these models, evaluation of LLMs’ reasoning ability is largely empirical. Creating datasets to evaluate the reasoning ability of LLMs is an active research area. A key open question is whether LLMs reason or simply recite memorized texts they have encountered during their training phase. This work conducts simple experiments using Cheryl’s Birthday Puzzle and Cheryl’s Age Puzzle to investigate whether LLMs recite or reason and discovers that LLMs tend to recite memorized answers for well-known questions, which appear frequently on the internet. As a result, to accurately evaluate the reasoning ability of LLMs, it is essential to create new datasets to ensure that LLMs truly use their reasoning ability to generate responses to the presented questions. In view of the finding, this work proposes a new dataset comprising of questions requiring semantic and deductive logical reasoning skills to elicit reasoning ability from LLMs. The proposed evaluation framework has several desirable properties, including resilience to training data contamination, ease of response verification, extensibility, usefulness and automated test case generation. This work applies the proposed dataset to evaluate the reasoning ability of state-of-the-art LLMs, including GPT-3, GPT-4, Llama-3.1, Germini-1.5, Claude-3.5 and DeepSeek-V3. A significant observation is that most LLMs achieve a performance independent of question complexity. This suggests that they reason more like an algorithm than human intelligence. In contrast, DeepSeek-V3 resembles human reasoning behaviour most among all the tested LLMs. Finally, an algorithm to automatically generate the dataset of logical reasoning questions is presented.

Version published to 10.20944/preprints202504.1933.v1
Apr 23, 2025

A Comparative Survey on Large Language Models for Biological Data

This article has 7 authors:
1. Ramin Mousa
2. Ali Sarabadani
3. Tania Taami
4. Amir Ali Bengari
5. Omid Eslamifar
6. Mohammad Alijanpour Shalmani
7. Ehsan Karimi Shahmarvandi
This article has no evaluationsLatest version Apr 29, 2025
Exploring Explainability in Large Language Models

This article has 3 authors:
1. Fen Yin
2. Mu Zhong
3. Zhihao Ru
This article has no evaluationsLatest version Mar 31, 2025
Combined GNN specialized in inductive prediction and PLM for natural language inductive reasoning

This article has 2 authors:
1. Koki Tomei
2. Masafumi Hagiwara
This article has no evaluationsLatest version Apr 29, 2025

Listed in

Abstract

Article activity feed

Related articles

A Comparative Survey on Large Language Models for Biological Data

Exploring Explainability in Large Language Models

Combined GNN specialized in inductive prediction and PLM for natural language inductive reasoning