Modeling dynamic social vision highlights gaps between deep learning and humans

Kathy Garcia
Emalie McMahon
Colin Conwell
Michael F. Bonner
Leyla Isik

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Deep learning models trained on computer vision tasks are widely considered the most successful models of human vision to date. The majority of work that supports this idea evaluates how accurately these models predict brain and behavioral responses to static images of objects and natural scenes. Real-world vision, however, is highly dynamic, and far less work has focused on evaluating the accuracy of deep learning models in predicting responses to stimuli that move, and that involve more complicated, higher-order phenomena like social interactions. Here, we present a dataset of natural videos and captions involving complex multi-agent interactions, and we benchmark 350+ image, video, and language models on behavioral and neural responses to the videos. As with prior work, we find that many vision models reach the noise ceiling in predicting visual scene features and responses along the ventral visual stream (often considered the primary neural substrate of object and scene recognition). In contrast, image models poorly predict human action and social interaction ratings and neural responses in the lateral stream (a neural pathway increasingly theorized as specializing in dynamic, social vision). Language models (given human sentence captions of the videos) predict action and social ratings better than either image or video models, but they still perform poorly at predicting neural responses in the lateral stream. Together these results identify a major gap in AI's ability to match human social vision and highlight the importance of studying vision in dynamic, natural contexts.

Version published to 10.31234/osf.io/4mpd9 on OSF Preprints
Jun 11, 2024

Multi-Gate Mixture-of-Experts with Explanation for Predictive Computational Personality Analysis

This article has 5 authors:
1. Ahmed R. Elmahalawy
2. Min Wei
3. Xiaohua Wu
4. Xiaohui Tao
5. Lin Li
This article has no evaluationsLatest version Dec 12, 2025
Multimodal Vision Language Models in Interactive and Physical Environments

This article has 4 authors:
1. Lucas Pereira
2. Martina Kovács
3. Ahmed El-Masry
4. Feidlimid Shyama
This article has no evaluationsLatest version Dec 26, 2025
Object Recognition from Dynamic Cues: Examining the Inversion Eﬀect

This article has 5 authors:
1. Niloufar Faridani
2. Shifa Maqsood
3. Sophia Robert
4. Amirali Soltani Tehrani
5. Maryam Vaziri-Pashkam
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multi-Gate Mixture-of-Experts with Explanation for Predictive Computational Personality Analysis

Multimodal Vision Language Models in Interactive and Physical Environments

Object Recognition from Dynamic Cues: Examining the Inversion Eﬀect