A Naturalistic Embodied Human Multimodal Interaction Dataset: Systematically Annotated Behavioural Visuo-Auditory Cue and Attention Data

Vipul Nair
Mehul Bhatt
Jakob Suchan
Erik Billing
Paul Hemeren

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding naturalistic human behavior in real-world settings requires capturing its inherent complexity, where interactions are shaped by dynamic visual, spatial, and auditory cues. These multimodal signals—such as gaze, hand action, and speech—encode critical referential and social information that shape how humans attend to and perceive ongoing events. However, existing datasets often lack the granularity needed to systematically examine the interplay of these naturalistic multimodal cues, particularly in ways that support both behavioral and computational research. To address this, we present an extensive and detailed dataset comprising acted (day-to-day) human interactions captured in short event scenarios, totaling 27 scenes across 9 story contexts and featuring over 32 individuals in varied roles. Each scene includes a range of visuo-spatial and auditory cues, found in both controlled (manipulated) and naturally occurring segments. In addition to the scenes, we provide visual attention data directed towards scene elements, collected via eye-tracking from 90 participants. The dataset is meticulously annotated through a combination of manual and semi-automated processes—characterizing scene elements, event structures, frame-by-frame bounding boxes of individuals and their body parts, as well as both low- and high-level visual attention predicates. We position the scenes, events, and attention annotations with a focus on uncovering the complexities of human behavior in real-world settings—particularly vis-à-vis what unfolds during an event and where individuals direct their visual attention—making it ideal for hypothesis testing and benchmarking naturalistic multimodal human actions and events. By integrating narrative-driven scenarios with rich visuo-spatial and auditory annotations, this dataset bridges the gap between lab-controlled paradigms and ecological validity, offering a versatile resource for both behavioral studies (e.g., initiating joint attention) and computational applications (e.g., validating attention models). Its dual utility extends to domains such as human-robot interaction, cognitive media design, and adaptive virtual agents, where interpreting human interactions is critical to advancing technologies reliant on real-world behavioral semantics.

Version published to 10.31234/osf.io/6ctwu_v1 on OSF Preprints
Jun 23, 2025

Cross-Modal Temporal Attention for Robust Multimodal Emotion Recognition

This article has 3 authors:
1. Briar Calloway
2. Wyne Nasir
3. Caelum Finch
This article has no evaluationsLatest version Jun 27, 2025
Post-Saccadic Disruption of Semantic Category Information in Naturalistic Scenes

This article has 3 authors:
1. Yong Min Choi
2. Tzu-Yao Chiu
3. Julie D Golomb
This article has no evaluationsLatest version Jun 10, 2025
Frontal cortex organization supporting audiovisual processing during naturalistic viewing

This article has 9 authors:
1. Faxin Zhou
2. Amirhossein Khalilian-Gourtani
3. Patricia Dugan
4. Andrew Michalak
5. Orrin Devinsky
6. Peter Rozman
7. Werner Doyle
8. Daniel Friedman
9. Adeen Flinker
This article has no evaluationsLatest version Jun 28, 2025

Listed in

Abstract

Article activity feed

Related articles

Cross-Modal Temporal Attention for Robust Multimodal Emotion Recognition

Post-Saccadic Disruption of Semantic Category Information in Naturalistic Scenes

Frontal cortex organization supporting audiovisual processing during naturalistic viewing