SCALER: A Procedurally Generated, Leakage-Resistant Benchmark for Evaluating Multi-Step Reasoning in Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The evaluation of multi-step reasoning capabilities in Large Language Models (LLMs) faces three fundamental challenges that threaten the validity of current benchmarking paradigms: pervasive benchmark contamination through training data leakage, ceiling effects that eliminate discriminative power on static test sets, and the absence of principled methods to systematically characterize how reasoning accuracy degrades with task complexity. We introduce SCALER (Systematic Complexity Assessment for Language-based Evaluation of Reasoning), a comprehensive theoretical framework and practical system for procedurally generating reasoning tasks with precisely controlled and mathematically grounded difficulty parameters. Our framework leverages established computational libraries including SymPy for symbolic computation and NetworkX for graph-theoretic operations to produce problems across six diverse reasoning domains: arithmetic chains, symbolic algebra, graph traversal, logical deduction, constraint satisfaction, and compositional reasoning. Each generated instance comes with mathematically verified solutions and complete ground-truth reasoning traces derived from the generation process itself. We formalize task difficulty through a multi-dimensional complexity metric $\complexity(\task)$ that decomposes reasoning demands into interpretable components: reasoning depth (chain length), branching factor (search space size), working memory load (cognitive state requirements), and domain-specific parameters. This metric is calibrated through a novel model-agnostic optimization protocol that ensures objectivity and cross-model validity. Through extensive evaluation of seven frontier LLMs---including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B---across 12,000 unique, on-demand generated instances, we establish robust empirical scaling relationships between task complexity and model accuracy. Our results reveal several critical findings: (1) reasoning accuracy follows a sigmoidal decay pattern characterized by a critical complexity threshold $\complexity^*$ beyond which performance collapses; (2) this threshold varies significantly by reasoning domain, with compositional tasks showing the steepest degradation; (3) there exists a substantial ''fidelity gap'' between outcome accuracy and procedural correctness, indicating models often succeed through flawed reasoning. We introduce Reasoning Trace Fidelity (RTF), a novel process-based evaluation metric that quantifies alignment between model-generated reasoning chains and ground-truth derivations. Through detailed case studies and failure mode analysis, we identify specific architectural and algorithmic limitations that manifest at different complexity scales.

Article activity feed