DMES: Information-Equivalent Evaluation Reveals the Physical Reasoning Gap Between World Models and Language Models

Liutao Hu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Evaluating physical reasoning across fundamentally different architectures—large language models (LLMs) and vision-based world models—is confounded by the fact that text and image inputs carry naturally non-equivalent information. We propose Dual-Mode Equivalent Stimulus (DMES), an evaluation framework that constructs text-image pairs with rigorously equal information content (H(T) =H(I) = H(S)), eliminating modality confounds from cross-architecture comparisons. Applying DMES to classical mechanics, we create DMES-PC, a 140-sample benchmark spanning 7 physics categories, and evaluate three LLMs (GLM-4.7, Qwen3-Coder-Next, Gemma 4; 26B–355B parameters) against V-JEPA 2 (300M parameters) under three evaluation paradigms. Our results reveal a clear parameter efficiency paradox: V-JEPA 2 achieves both higher prediction accuracy (mean score 0.692, SD=0.032) and substantially lower variance than all tested LLMs (mean 0.320–0.600, SD=0.355–0.383). V-JEPA 2 significantly outperforms GLM-4.7 (Cohen’s d = 1.27, p < 0.001) and Gemma 4 (d = 1.40, p < 0.001), while the comparison with the strongest LLM, Qwen3-Coder-Next, does not reach significance (d = −0.34, p _adj = 0.573)—a consequence of that model’s high variance overlapping with V-JEPA 2’s tight score range. A three-paradigm analysis (direct prediction, code simulation, hybrid) further reveals a “knowing vs. simulating” gap: LLMs generate near-perfect physics code when they succeed (score ≈ 1.0), but their direct predictions remain unreliable (score ≈ 0.5), with low internal consistency (0.32–0.69). These findings suggest that scaling LLMs alone is insufficient for physical world understanding—architectures with predictive coding inductive biases are needed.

Version published to 10.21203/rs.3.rs-9323275/v1 on Research Square
Apr 7, 2026

Hebbian inertia and massless reasoning: comparative cognitive architecture in human and large language model systems

This article has 2 authors:
1. Emary Iacobucci
2. Joseph Woelfel
This article has no evaluationsLatest version Apr 24, 2026
Knowing Before Speaking: In-Computation Metacognition Precedes Verbal Confidence in Large Language Models

This article has 1 author:
1. Jaehwan Kim
This article has no evaluationsLatest version Apr 3, 2026
Emergent Non-Classical Probabilistic Structure in Large Language Models Under Contextual Modulations

This article has 1 author:
1. Jyotiranjan Beuria
This article has no evaluationsLatest version Apr 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Hebbian inertia and massless reasoning: comparative cognitive architecture in human and large language model systems

Knowing Before Speaking: In-Computation Metacognition Precedes Verbal Confidence in Large Language Models

Emergent Non-Classical Probabilistic Structure in Large Language Models Under Contextual Modulations