Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition

Hanyou Huang
Changnan Jiang
Ziyuan Zhang
Heqing Ouyang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Action detection aims to identify the category and temporal boundaries (start and end times) of each action within a video sequence. It has wide applications in areas such as human-computer interaction. With the advancement of deep learning, the accuracy of action detection has significantly improved. However, challenges remain, such as the difficulty in precisely segmenting consecutive actions and the ambiguous boundaries between subjects and backgrounds in videos, which may lead to inaccurate predictions. In this work, we propose a novel Transformer-based action detection framework, termed Swin-CATPN, which integrates a Contextual Attention Memory (CAM) module and a Deformable Convolution Network (DCN) to enhance both detection precision and robustness. Extensive experiments on three benchmark datasets—Kinetics-400, Something-Something V1/V2, and EPIC-Kitchens—demonstrate that our model consistently outperforms existing state-of-the-art approaches in terms of Top-1 accuracy.

Version published to 10.21203/rs.3.rs-7069952/v1 on Research Square
Sep 2, 2025

Cultural Heritage-Inspired Deep Framework forSports Action Recognition and Competition BehaviorAnalysis

This article has 1 author:
1. Zonghao Wang
This article has no evaluationsLatest version Oct 6, 2025
Multimodal Spatio-Temporal Attention Networks with Multi-Head Residual Recurrent Encoding for Human Activity Tracking

This article has 3 authors:
1. Monirul Islam Mahmud
2. Md Shihab Reza
3. Hafeza Akter
This article has no evaluationsLatest version Oct 6, 2025
Rethinking Convolutional Semantics for Image Caption Generation Beyond Recurrent Paradigms

This article has 4 authors:
1. Noah Macdonald
2. Sofia Leblanc
3. Landen Whitaker
4. Arman Chowdhury
This article has no evaluationsLatest version Oct 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Cultural Heritage-Inspired Deep Framework forSports Action Recognition and Competition BehaviorAnalysis

Multimodal Spatio-Temporal Attention Networks with Multi-Head Residual Recurrent Encoding for Human Activity Tracking

Rethinking Convolutional Semantics for Image Caption Generation Beyond Recurrent Paradigms