Enhancing Action Recognition via Dynamic Cross-Frame Differential Modeling

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In the realm of action recognition, the dynamic changes of key human body parts across consecutive frames encapsulate the core semantic information of actions. Traditional approaches often prioritize single-frame static features or perform simplistic temporal modeling, overlooking the capacity of multi-scale frame differences to effectively characterize nuanced local action details. This paper introduces DCDNet, an action recognition method grounded in dynamic cross-frame differences, designed to explicitly enhance spatiotemporal difference perception through a multi-branch temporal modeling architecture. The first proposed strict alignment mechanism for cross-frame differentials directly links the dilation rate d of each dilated convolution branch to the frame interval for differential calculation. Specifically, for a branch with dilation rate d=n, the differential operation is precisely constrained between the t-th and (t+n)-th frames. This design effectively addresses the decoupling problem of temporal perception and differential calculation found in existing methods, enabling accurate, decoupled modeling of multi-scale motion patterns. Through hierarchical feature fusion, DCDNet achieves state-of-the-art performance on the HMDB51, UCF101, and INCLUDE datasets, with accuracy rates of 74.01%, 92.99%, and 92.94%, respectively. Visualization results corroborate DCDNet's capability to precisely localize fine-grained action segments, such as punching trajectories and gesture transitions, thereby substantiating its advantages in decoupling spatiotemporal features and focusing on dynamic weights.

Article activity feed