Assessing Reasoning Capabilities of Commercial LLMs: A Comparative Study of Inductive and Deductive Tasks

Rowena Witali
Quentin Latrese
Giles Ravenscroft

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Artificial intelligence has revolutionized various fields through its ability to process and generate human-like text, leading to significant advancements in tasks requiring language comprehension and generation. However, the evaluation of fundamental reasoning abilities within commercial LLMs, specifically in inductive and deductive reasoning, remains crucial for understanding their cognitive capabilities and limitations. This research provides a comprehensive assessment of ChatGPT, Gemini, and Claude, using a meticulously designed set of reasoning tasks to evaluate their performance. The methodology involved the selection of diverse datasets, the design of complex reasoning tasks, and the implementation of a robust automated testing framework. Statistical analyses, including ANOVA and regression techniques, were employed to rigorously compare the models’ performance across different tasks. Results indicated that ChatGPT consistently outperformed the other models, particularly excelling in tasks requiring high precision and recall, while Gemini and Claude exhibited variability in their reasoning capabilities. The study highlights the strengths and weaknesses of each model, offering insights into their relative performance and potential areas for improvement. Implications for AI development are significant, emphasizing the need for tailored model designs and continued innovation in training techniques to enhance reasoning abilities. This research contributes to the broader understanding of AI reasoning, providing a foundation for future advancements in developing more capable and reliable intelligent systems.

Version published to 10.22541/au.172296312.22226544/v1
Aug 6, 2024

The Reality of Prompt Engineering: Simplicity Often Outperforms Sophistication in Reasoning Tasks

This article has 3 authors:
1. Abdulhamid Onawole
2. Anandhi Vivek Dhukaram
3. Adetola Adeniyi
This article has no evaluationsLatest version Aug 21, 2025
Exploration of Stability Judgments: From Multimodal LLMs to Human Insights

This article has 7 authors:
1. Mury Fajar Dewantoro
2. Febri Abdullah
3. Yi Xia
4. Ibrahim Khan
5. Ruck Thawonmas
6. Wenwen Ouyang
7. Fitra Abdurrachman Bachtiar
This article has no evaluationsLatest version Sep 22, 2025
Towards Interpretable and Consistent Multi-Step Mathematical Reasoning in Large Language Models

This article has 5 authors:
1. Xinyue Huang
2. Zeyu Wang
3. Xin Liu
4. Yueqi Tian
5. Qian Leng
This article has no evaluationsLatest version Oct 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Reality of Prompt Engineering: Simplicity Often Outperforms Sophistication in Reasoning Tasks

Exploration of Stability Judgments: From Multimodal LLMs to Human Insights

Towards Interpretable and Consistent Multi-Step Mathematical Reasoning in Large Language Models