Weakly Supervised Temporal Action Localization Based on Feature Enhancement

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Weakly-supervised Temporal Action Localization (WTAL) aims to accuratelylocalize and classify action instances in untrimmed long videos using only video-level annotations. Although most existing WTAL methods leverage pre-trainedfeature extractors to obtain RGB and optical flow features—thereby reducing computational costs—this strategy suffers from two critical limitations: (1)limited temporal receptive fields, resulting in inadequate exploitation of contextual information; and (2) interference from irrelevant background content,which degrades overall performance. To address these issues, we propose aFeature-Enhanced Network (FE-Net), which comprises three key components: theLocal Feature Expansion and Enhancement Module (LF-EEM), the Cross-modalFusion Enhancement Module (CEM), and the Cross-temporal Gated FeatureFusion Module (CGFF). Specifically, LF-EEM expands the temporal receptivefield to better capture complete action instances. CEM leverages the complementary nature of auxiliary and primary modalities to suppress background noise inthe primary modality through cross-modal fusion. Furthermore, CGFF employsa cross-temporal gating mechanism during feature fusion to emphasize salientchanges across time, replacing simple concatenation. Extensive experiments conducted on two large-scale benchmark datasets, THUMOS-14 and ActivityNetv1.2, demonstrate that FE-Net significantly enhances the performance of existingWTAL methods. These results validate the effectiveness of our proposed modulesand provide new insights for advancing temporal action localization under weak supervision.

Article activity feed