SCALER: A Procedurally Generated, Leakage-Resistant Benchmark for Evaluating Multi-Step Reasoning in Large Language Models

Amirali Ghajari

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The evaluation of multi-step reasoning capabilities in Large Language Models (LLMs) faces three fundamental challenges that threaten the validity of current benchmarking paradigms: pervasive benchmark contamination through training data leakage, ceiling effects that eliminate discriminative power on static test sets, and the absence of principled methods to systematically characterize how reasoning accuracy degrades with task complexity. We introduce SCALER (Systematic Complexity Assessment for Language-based Evaluation of Reasoning), a comprehensive theoretical framework and practical system for procedurally generating reasoning tasks with precisely controlled and mathematically grounded difficulty parameters. Our framework leverages established computational libraries including SymPy for symbolic computation and NetworkX for graph-theoretic operations to produce problems across six diverse reasoning domains: arithmetic chains, symbolic algebra, graph traversal, logical deduction, constraint satisfaction, and compositional reasoning. Each generated instance comes with mathematically verified solutions and complete ground-truth reasoning traces derived from the generation process itself. We formalize task difficulty through a multi-dimensional complexity metric $\complexity(\task)$ that decomposes reasoning demands into interpretable components: reasoning depth (chain length), branching factor (search space size), working memory load (cognitive state requirements), and domain-specific parameters. This metric is calibrated through a novel model-agnostic optimization protocol that ensures objectivity and cross-model validity. Through extensive evaluation of seven frontier LLMs---including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B---across 12,000 unique, on-demand generated instances, we establish robust empirical scaling relationships between task complexity and model accuracy. Our results reveal several critical findings: (1) reasoning accuracy follows a sigmoidal decay pattern characterized by a critical complexity threshold $\complexity^*$ beyond which performance collapses; (2) this threshold varies significantly by reasoning domain, with compositional tasks showing the steepest degradation; (3) there exists a substantial ''fidelity gap'' between outcome accuracy and procedural correctness, indicating models often succeed through flawed reasoning. We introduce Reasoning Trace Fidelity (RTF), a novel process-based evaluation metric that quantifies alignment between model-generated reasoning chains and ground-truth derivations. Through detailed case studies and failure mode analysis, we identify specific architectural and algorithmic limitations that manifest at different complexity scales.

Version published to 10.21203/rs.3.rs-8251575/v1 on Research Square
Dec 8, 2025

QNLP-Bench: A Standardized Benchmark and Evaluation Framework for Quantum Natural Language Processing

This article has 1 author:
1. Parham Ghayour
This article has no evaluationsLatest version Dec 19, 2025
Complexity-Aware Symbolic Solvers for Sudoku: Towards Interpretability and Cognitive Alignment in Constraint Problem Solving

This article has 2 authors:
1. Rajan Thangamani
2. R Pallavi
This article has no evaluationsLatest version Dec 22, 2025
Multi-Sallm: A Multilingual Security Assessment of Generated Code

This article has 5 authors:
1. Mohammed Latif Siddiq
2. Noshin Ulfat
3. Nishat Raihan
4. Joanna C. S. Santos
5. Marcos Zampieri
This article has no evaluationsLatest version Dec 16, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

QNLP-Bench: A Standardized Benchmark and Evaluation Framework for Quantum Natural Language Processing

Complexity-Aware Symbolic Solvers for Sudoku: Towards Interpretability and Cognitive Alignment in Constraint Problem Solving

Multi-Sallm: A Multilingual Security Assessment of Generated Code