Can Vision Language Models Follow Human Gaze

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Gaze understanding is suggested as a precursor to inferring intentions and engaging in joint attention, core capacities for a theory of mind, social learning, and language acquisition.As Vision Language Models (VLMs) become increasingly promising in interactive applications, assessing whether they master this foundational socio-cognitive skill becomes vital.Rather than creating a benchmark, we aim to probe the behavioral features of the underlying gaze understanding. We curated a set of images with systematically controlled difficulty and variability, evaluated 111 VLMs' abilities to infer gaze referents, and analyzed their performance using mixed-effect models.Only 20 VLMs performed above chance, with still low overall accuracy.We further analyzed 4 of these top-tier VLMs and found that their performance declined with increasing task difficulty but varied only slightly with the specific prompt and gazer.While their gaze understanding remains far from mature, the patterns suggest that their inferences are far different than merely stochastic parroting.This early progress highlights the need for mechanistic investigations of their underlying emergent inference.

Article activity feed