CausalReasonBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Causal reasoning—the ability to identify cause-effect relationships, predictdownstream effects, construct counterfactuals, and trace causal chains—is a fundamental requirement for trustworthy AI systems deployed in expert domains.Despite the remarkable performance of large language models (LLMs) on awide range of natural language understanding tasks, their capacity for rigorouscausal reasoning remains poorly understood and insufficiently benchmarked. Weintroduce CausalReasonBench, a comprehensive benchmark evaluating LLMson four causal reasoning task types across four real-world domains (physical,social, biological, technological) at three difficulty levels, yielding 384 controlledevaluation instances per model. We propose three automatic metrics: CausalIdentification Rate (CIR), Causal Logic Precision (CLP), and CounterfactualCoherence Ratio (CCR), combined into the Composite Assessment of Reasoning (CAR). Extensive experiments with three open LLMs under four promptingstrategies reveal that: (i) Socratic prompting outperforms chain-of-thought oncounterfactual tasks by +0.09 CAR; (ii) biological and social domains exhibitthe largest performance gaps across models, indicating domain-specific knowledge deficiencies; (iii) adversarial scenarios expose systematic over-attributionof causality in all models, with correlation-as-causation error rates exceeding40%. We provide detailed analyses of sub-question quality, error taxonomies, anddomain-specific failure modes. CausalReasonBench is released to support rigorous causal reasoning evaluation and to guide the development of more causallycompetent language models.

Article activity feed