Integrated Cross-Modal Learning for Interactive Video Conversation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Interactive video conversation is a complex multimodal task that requires the simultaneous interpretation of dynamic visual scenes, textual dialogue, and auditory signals when available. In recent years, significant progress has been achieved by leveraging powerful transformer-based language models, which have established new performance benchmarks. However, many of these advanced systems tend to focus excessively on textual features, resulting in an underutilization of the rich visual cues present in video data. To address this challenge, we propose a novel cross-modal framework, referred to as \textbf{CIMT}, which seamlessly integrates 3D convolutional neural networks (3D-CNNs) with transformer-based architectures into a unified visual encoder. This encoder is engineered to extract robust semantic representations by learning local temporal features and contextualizing them through self-attention mechanisms. The resulting visual features are effectively combined with text and audio representations within an end-to-end trained architecture. Experimental results on established interactive video conversation benchmarks demonstrate that CIMT significantly outperforms baseline models on both generative and retrieval tasks, highlighting the benefits of integrated visual-textual learning.