Exploration of Stability Judgments: Assessing Multimodal LLMs in Game-Inspired Physical Reasoning Tasks

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study extends our previous investigation into whether multimodal large language models (MLLMs) can reason about physical reasoning, using a game environment as the testbed. Stability served as a foundational scenario to probe model understanding of physical reasoning. We evaluated twelve models, combining those from the earlier study with six additional open-weight models, across three tasks designed to capture different aspects of reasoning. Human participants were included as a reference point, consistently achieving the highest accuracy, underscoring the gap between model and human performance. Among MLLMs, the GPT series continued to perform strongly, with GPT-4o showing reliable results in image-based tasks, while the Qwen2.5-VL series reached the highest overall scores in this extended study and in some cases surpassed commercial counterparts. Simpler binary tasks yielded balanced performance across modalities, suggesting that models can capture certain basic aspects of reasoning, whereas more complex multiple-choice tasks led to sharp declines in accuracy. Structured inputs such as XML improved results in the prediction task, where Qwen2.5-VL outperformed GPT variants in our earlier work. These findings demonstrate progress in scaling and modality design for physical reasoning, while reaffirming that human participants remain superior across all tasks.

Article activity feed