Cross-modal Attention and Bidirectional LSTM for Audio- Visual Generalized Zero-shot Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The goal of audio-visual zero-shot learning (ZSL) is to classify videos from unseen categories using only the images and audio of videos from seen categories as training data. However, existing audio-visual zero-shot learning methods have not effectively captured accurate temporal information. To address this, we propose a novel approach that leverages the natural alignment between audio and visual modalities in video data by combining cross-modal attention and bidirectional LSTM (Bi-LSTM) to enhance audio-visual generalized zero-shot learning (GZSL). By integrating Bi-LSTM in parallel with the cross-attention block, we capture the temporal dependencies in both audio and visual streams. Specifically, during cross-modal processing, we add a bidirectional LSTM channel in parallel for data processing and fuse the data obtained from the two channels. This approach helps capture temporal dependencies in audio and video more effectively, enhances model robustness, and reduces information loss. Compared to various state-of-the-art methods, our proposed CAB-LSTM outperforms most existing approaches on three popular audio-visual zero-shot learning datasets.