Integrating Adaptive Spatio-Temporal and Motion Features in a Unified 2D Networks for Video Action Recognition

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Video action recognition is an important computer vision task that aims to classify observed activities into predefined categories. Both spatial-temporal and motion features are essential and complementary for recognizing these actions.However, many existing methods do not fully utilize the complementary between features, as they extract different features in separate branches and then merely sum to fuse them together.To address this issue, we propose Spatio-Temporal Adaptive and Motion Feature Extraction Module (STAM).STAM consists of two key components: the Spatio-Temporal Adaptive Convolution Module (STAC) and the Motion Feature Extraction Module (MFEM). Firstly, the STAC generates adaptive convolution kernels from global features and generates adaptive weights from local features, and generates spatio-temporal features using adaptive convolution.Secondly, the MFEM captures motion features by analyzing differences between consecutive frames and further enhances spatio-temporal features. Additionally, STAM introduces a novel multi-feature fusion strategy that merges motion and spatio-temporal features, optimizing the complementary of different features.Finally, STAM achieved competitive results in evaluations conducted on the Kinetics-400, Something-Something V1 and V2 datasets.

Article activity feed