From Computation to Adjudication: Evaluating Large Language Model Judges on Mathematical Reasoning and Precision Calculation

Dominic Yanid
Augustus Davenport
Xavier Carmichael
Nikolai Thompson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent developments in language models have sparked interest in their potential applications beyond natural language tasks, including domains that require precise mathematical reasoning. The novel concept of using a language model as a judge for mathematical tasks offers a unique perspective on its ability to handle both computational precision and logical reasoning. The study involved an evaluation of GPT-Neo's performance on a series of mathematical reasoning and calculation tasks, ranging from basic arithmetic to complex multi-step problems. The results indicated that while the model excels in basic operations with high accuracy, its performance significantly decreases with the increasing complexity of tasks, particularly those involving abstract reasoning and symbolic manipulation. Analysis of error patterns revealed limitations in the model's processing mechanisms, highlighting a reliance on learned patterns rather than a deeper understanding of mathematical principles. The findings contribute to the understanding of the capabilities and constraints of language models in mathematical contexts, providing a foundational assessment that informs future advancements in artificial intelligence systems designed for complex reasoning tasks.

Version published to 10.31219/osf.io/8y3km on OSF Preprints
Sep 19, 2024

Reasoning in Large Language Models: A Survey

This article has 3 authors:
1. Yu Fu
2. Yongqi Kang
3. Yong Zhao
This article has no evaluationsLatest version Oct 14, 2025
The role of syntax in numerical and mathematical processing

This article has 2 authors:
1. Dror Dotan
2. Noa Handelsman
This article has no evaluationsLatest version Oct 23, 2025
Towards Interpretable and Consistent Multi-Step Mathematical Reasoning in Large Language Models

This article has 5 authors:
1. Xinyue Huang
2. Zeyu Wang
3. Xin Liu
4. Yueqi Tian
5. Qian Leng
This article has no evaluationsLatest version Oct 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Reasoning in Large Language Models: A Survey

The role of syntax in numerical and mathematical processing

Towards Interpretable and Consistent Multi-Step Mathematical Reasoning in Large Language Models