Enhancing Action Recognition via Dynamic Cross-Frame Differential Modeling

Qiuhong Tian
Tiancheng Chen
Lizao Zhang
Ziyu Yang
Fei Zeng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In the realm of action recognition, the dynamic changes of key human body parts across consecutive frames encapsulate the core semantic information of actions. Traditional approaches often prioritize single-frame static features or perform simplistic temporal modeling, overlooking the capacity of multi-scale frame differences to effectively characterize nuanced local action details. This paper introduces DCDNet, an action recognition method grounded in dynamic cross-frame differences, designed to explicitly enhance spatiotemporal difference perception through a multi-branch temporal modeling architecture. The first proposed strict alignment mechanism for cross-frame differentials directly links the dilation rate d of each dilated convolution branch to the frame interval for differential calculation. Specifically, for a branch with dilation rate d=n, the differential operation is precisely constrained between the t-th and (t+n)-th frames. This design effectively addresses the decoupling problem of temporal perception and differential calculation found in existing methods, enabling accurate, decoupled modeling of multi-scale motion patterns. Through hierarchical feature fusion, DCDNet achieves state-of-the-art performance on the HMDB51, UCF101, and INCLUDE datasets, with accuracy rates of 74.01%, 92.99%, and 92.94%, respectively. Visualization results corroborate DCDNet's capability to precisely localize fine-grained action segments, such as punching trajectories and gesture transitions, thereby substantiating its advantages in decoupling spatiotemporal features and focusing on dynamic weights.

Version published to 10.21203/rs.3.rs-7009710/v1 on Research Square
Aug 18, 2025

Cultural Heritage-Inspired Deep Framework forSports Action Recognition and Competition BehaviorAnalysis

This article has 1 author:
1. Zonghao Wang
This article has no evaluationsLatest version Oct 6, 2025
A New Paradigm for Human Motion Generation Based on Cross-Modal Nested Alignment

This article has 5 authors:
1. Ethan M. Carter
2. Sophia L. Hayes
3. Benjamin T. Walker
4. Lucas J. Reynolds
5. Emily K. Foster
This article has no evaluationsLatest version Aug 28, 2025
Towards Human-Centered and Efficient Video Synthesis: A Survey of Multimodal Diffusion Models

This article has 2 authors:
1. Alaa Abdullah Albaghdadi
2. Ahmad R. Naghsh-Nilchi
This article has no evaluationsLatest version Oct 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Cultural Heritage-Inspired Deep Framework forSports Action Recognition and Competition BehaviorAnalysis

A New Paradigm for Human Motion Generation Based on Cross-Modal Nested Alignment

Towards Human-Centered and Efficient Video Synthesis: A Survey of Multimodal Diffusion Models