Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Action detection aims to identify the category and temporal boundaries (start and end times) of each action within a video sequence. It has wide applications in areas such as human-computer interaction. With the advancement of deep learning, the accuracy of action detection has significantly improved. However, challenges remain, such as the difficulty in precisely segmenting consecutive actions and the ambiguous boundaries between subjects and backgrounds in videos, which may lead to inaccurate predictions. In this work, we propose a novel Transformer-based action detection framework, termed Swin-CATPN, which integrates a Contextual Attention Memory (CAM) module and a Deformable Convolution Network (DCN) to enhance both detection precision and robustness. Extensive experiments on three benchmark datasets—Kinetics-400, Something-Something V1/V2, and EPIC-Kitchens—demonstrate that our model consistently outperforms existing state-of-the-art approaches in terms of Top-1 accuracy.