DMES: Information-Equivalent Evaluation Reveals the Physical Reasoning Gap Between World Models and Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Evaluating physical reasoning across fundamentally different architectures—large language models (LLMs) and vision-based world models—is confounded by the fact that text and image inputs carry naturally non-equivalent information. We propose Dual-Mode Equivalent Stimulus (DMES), an evaluation framework that constructs text-image pairs with rigorously equal information content (H(T) =H(I) = H(S)), eliminating modality confounds from cross-architecture comparisons. Applying DMES to classical mechanics, we create DMES-PC, a 140-sample benchmark spanning 7 physics categories, and evaluate three LLMs (GLM-4.7, Qwen3-Coder-Next, Gemma 4; 26B–355B parameters) against V-JEPA 2 (300M parameters) under three evaluation paradigms. Our results reveal a clear parameter efficiency paradox: V-JEPA 2 achieves both higher prediction accuracy (mean score 0.692, SD=0.032) and substantially lower variance than all tested LLMs (mean 0.320–0.600, SD=0.355–0.383). V-JEPA 2 significantly outperforms GLM-4.7 (Cohen’s d = 1.27, p < 0.001) and Gemma 4 (d = 1.40, p < 0.001), while the comparison with the strongest LLM, Qwen3-Coder-Next, does not reach significance (d = −0.34, p adj = 0.573)—a consequence of that model’s high variance overlapping with V-JEPA 2’s tight score range. A three-paradigm analysis (direct prediction, code simulation, hybrid) further reveals a “knowing vs. simulating” gap: LLMs generate near-perfect physics code when they succeed (score ≈ 1.0), but their direct predictions remain unreliable (score ≈ 0.5), with low internal consistency (0.32–0.69). These findings suggest that scaling LLMs alone is insufficient for physical world understanding—architectures with predictive coding inductive biases are needed.

Article activity feed