Zero-Shot Action Recognition through Multimodal Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this paper, we propose a novel approach to zero-shot actionrecognition by leveraging the semantic embeddings derived from theStories dataset. Stories provides detailed textual descriptions of ac?tions in popular video benchmarks such as UCF101, and HMDB51,making it an ideal source for bridging the gap between video actionclasses and their semantic representations. We introduce a frameworkthat embeds these textual descriptions into a shared semantic space,allowing for the transfer of knowledge across action categories with?out the need for direct labeled training data. Our method employsa multi-modal learning approach, utilizing both visual features fromvideo frames and semantic embeddings from Stories to perform ac?tion recognition. The core idea is to train a model that can matchunseen actions with their corresponding semantic embeddings fromStories, enabling accurate recognition of novel action classes. Exten?sive experiments on UCF101, HMDB51 dataset demonstrate that ourmethod achieves competitive performance in zero-shot recognition,outperforming previous state-of-the-art methods. This work opensup new possibilities for scaling action recognition systems to unseenclasses by tapping into rich, high-level textual descriptions.