CausalReasonBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models

Ahmed Cherif

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Causal reasoning—the ability to identify cause-effect relationships, predictdownstream effects, construct counterfactuals, and trace causal chains—is a fundamental requirement for trustworthy AI systems deployed in expert domains.Despite the remarkable performance of large language models (LLMs) on awide range of natural language understanding tasks, their capacity for rigorouscausal reasoning remains poorly understood and insufficiently benchmarked. Weintroduce CausalReasonBench, a comprehensive benchmark evaluating LLMson four causal reasoning task types across four real-world domains (physical,social, biological, technological) at three difficulty levels, yielding 384 controlledevaluation instances per model. We propose three automatic metrics: CausalIdentification Rate (CIR), Causal Logic Precision (CLP), and CounterfactualCoherence Ratio (CCR), combined into the Composite Assessment of Reasoning (CAR). Extensive experiments with three open LLMs under four promptingstrategies reveal that: (i) Socratic prompting outperforms chain-of-thought oncounterfactual tasks by +0.09 CAR; (ii) biological and social domains exhibitthe largest performance gaps across models, indicating domain-specific knowledge deficiencies; (iii) adversarial scenarios expose systematic over-attributionof causality in all models, with correlation-as-causation error rates exceeding40%. We provide detailed analyses of sub-question quality, error taxonomies, anddomain-specific failure modes. CausalReasonBench is released to support rigorous causal reasoning evaluation and to guide the development of more causallycompetent language models.

Version published to 10.21203/rs.3.rs-9181514/v1 on Research Square
Apr 13, 2026

Reasonable Doubt and Appellate Review Through Probabilistic Causal DAGs

This article has 1 author:
1. Minseong Kim
This article has no evaluationsLatest version Apr 21, 2026
DMES: Information-Equivalent Evaluation Reveals the Physical Reasoning Gap Between World Models and Language Models

This article has 1 author:
1. Liutao Hu
This article has no evaluationsLatest version Apr 7, 2026
Emergent representations of graphical structure in mechanistic neural models of causal judgment

This article has 2 authors:
1. Marcus A Triplett
2. Kenneth Kay
This article has no evaluationsLatest version May 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Reasonable Doubt and Appellate Review Through Probabilistic Causal DAGs

DMES: Information-Equivalent Evaluation Reveals the Physical Reasoning Gap Between World Models and Language Models

Emergent representations of graphical structure in mechanistic neural models of causal judgment