A Naturalistic Embodied Human Multimodal Interaction Dataset: Systematically Annotated Behavioural Visuo-Auditory Cue and Attention Data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding naturalistic human behavior in real-world settings requires capturing its inherent complexity, where interactions are shaped by dynamic visual, spatial, and auditory cues. These multimodal signals—such as gaze, hand action, and speech—encode critical referential and social information that shape how humans attend to and perceive ongoing events. However, existing datasets often lack the granularity needed to systematically examine the interplay of these naturalistic multimodal cues, particularly in ways that support both behavioral and computational research. To address this, we present an extensive and detailed dataset comprising acted (day-to-day) human interactions captured in short event scenarios, totaling 27 scenes across 9 story contexts and featuring over 32 individuals in varied roles. Each scene includes a range of visuo-spatial and auditory cues, found in both controlled (manipulated) and naturally occurring segments. In addition to the scenes, we provide visual attention data directed towards scene elements, collected via eye-tracking from 90 participants. The dataset is meticulously annotated through a combination of manual and semi-automated processes—characterizing scene elements, event structures, frame-by-frame bounding boxes of individuals and their body parts, as well as both low- and high-level visual attention predicates. We position the scenes, events, and attention annotations with a focus on uncovering the complexities of human behavior in real-world settings—particularly vis-à-vis what unfolds during an event and where individuals direct their visual attention—making it ideal for hypothesis testing and benchmarking naturalistic multimodal human actions and events. By integrating narrative-driven scenarios with rich visuo-spatial and auditory annotations, this dataset bridges the gap between lab-controlled paradigms and ecological validity, offering a versatile resource for both behavioral studies (e.g., initiating joint attention) and computational applications (e.g., validating attention models). Its dual utility extends to domains such as human-robot interaction, cognitive media design, and adaptive virtual agents, where interpreting human interactions is critical to advancing technologies reliant on real-world behavioral semantics.

Article activity feed