Hierarchical Cross-lingual Representation Learning for Diverse Video Contexts
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study investigates advanced self-supervised audio-visual models designed to learn from multilingual instructional videos. While prior research demonstrated the capability of these models to associate spoken language and auditory cues with visual content when trained on extensive English-language video datasets, their application has been largely restricted to English. To address this limitation, we introduce a novel hierarchical framework, termed Cross-Lingual Audio-Visual Learning Framework (CLAVLearn), which sequentially leverages pre-trained English models for adaptation to other languages such as Japanese and Hindi. Through this approach, we observe a tenfold enhancement in video-to-audio and audio-to-video retrieval accuracy when applied to Japanese instructional videos compared to language-specific training alone. Furthermore, we extend the framework's utility to multilingual spoken captions for images, achieving state-of-the-art results across Japanese and Hindi datasets. These findings underscore the potential of leveraging pre-trained English audio-visual models to drive multilingual innovation. Comprehensive experimental results validate our approach, providing insights into the scalability of self-supervised multilingual audio-visual learning.