Exploration of Stability Judgments: From Multimodal LLMs to Human Insights

Mury Fajar Dewantoro
Febri Abdullah
Yi Xia
Ibrahim Khan
Ruck Thawonmas
Wenwen Ouyang
Fitra Abdurrachman Bachtiar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study extends our previous investigation into whether multimodal large language models (MLLMs) can reason about physical reasoning, using a game environment as the testbed. Stability served as a foundational scenario to probe model understanding of physical reasoning. We evaluated twelve models, combining those from the earlier study with six additional Open-weight models, and compared them with human participants across three tasks that captured different aspects of reasoning. Humans consistently achieved the highest accuracy, underscoring the gap between model and human performance. Among MLLMs, the GPT series continued to perform strongly, with GPT-4o showing reliable results in image-based tasks, while the Qwen2.5VL series reached the highest overall scores in this extended study and in some cases surpassed commercial counterparts. Simpler binary tasks yielded balanced performance across modalities, suggesting that models can capture certain basic aspects of reasoning, whereas more complex multiple-choice tasks led to sharp declines in accuracy. Structured inputs such as XML improved results in the prediction task, where Qwen2.5VL has outperformed GPT variants in our earlier work. These findings demonstrate progress in scaling and modality design for physical reasoning, while reaffirming that human participants remain superior across all tasks.

Version published to 10.20944/preprints202509.1708.v1
Sep 22, 2025

Six fallacies in substituting large language models for human participants

This article has 1 author:
1. Zhicheng Lin
This article has no evaluationsLatest version Aug 21, 2025
Rethinking Think-Aloud in the Age of Language Models

This article has 3 authors:
1. Hanbo Xie
2. Hua-Dong Xiong
3. Robert C Wilson
This article has no evaluationsLatest version Sep 10, 2025
Rethinking Think-Aloud in the Age of Language Models

This article has 3 authors:
1. Hanbo Xie
2. Hua-Dong Xiong
3. Robert C Wilson
This article has no evaluationsLatest version Sep 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Six fallacies in substituting large language models for human participants

Rethinking Think-Aloud in the Age of Language Models

Rethinking Think-Aloud in the Age of Language Models