Reassessing Multimodal Pathways for Learning Action Meaning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The semantic interpretation of actions is deeply intertwined with how change unfolds over time, space, and interaction. Prior theoretical and computational work has suggested that explicitly modeling three-dimensional motion---including object positions and orientations evolving through time---should offer a privileged pathway for encoding fine-grained verb meaning, especially for distinctions such as \textit{roll} versus \textit{slide}. At the same time, the vast majority of multimodal language models rely almost exclusively on two-dimensional visual inputs, implicitly assuming that such projections suffice to ground linguistic meaning. In this work, we revisit this assumption through a systematic and tightly controlled comparison of visual and motion-based modalities. We construct self-supervised encoders over both 2D video observations and 3D trajectory data, and probe the resulting representations for their capacity to discriminate verb-level semantic categories. Contrary to prevailing intuition, our empirical analysis reveals that representations learned from 2D visual streams are competitive with, and in some cases indistinguishable from, those derived from explicit 3D trajectories. These findings complicate the widely held belief that richer environmental encodings automatically lead to superior semantic representations, and suggest that the relationship between perceptual fidelity and linguistic abstraction is more nuanced than often assumed. Our study offers early evidence that effective verb representation may emerge from multiple perceptual pathways, motivating a rethinking of how embodiment and modality interact in multimodal language learning.