Weakly Supervised Temporal Action Localization Based on Feature Enhancement
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Weakly-supervised Temporal Action Localization (WTAL) aims to accuratelylocalize and classify action instances in untrimmed long videos using only video-level annotations. Although most existing WTAL methods leverage pre-trainedfeature extractors to obtain RGB and optical flow features—thereby reducing computational costs—this strategy suffers from two critical limitations: (1)limited temporal receptive fields, resulting in inadequate exploitation of contextual information; and (2) interference from irrelevant background content,which degrades overall performance. To address these issues, we propose aFeature-Enhanced Network (FE-Net), which comprises three key components: theLocal Feature Expansion and Enhancement Module (LF-EEM), the Cross-modalFusion Enhancement Module (CEM), and the Cross-temporal Gated FeatureFusion Module (CGFF). Specifically, LF-EEM expands the temporal receptivefield to better capture complete action instances. CEM leverages the complementary nature of auxiliary and primary modalities to suppress background noise inthe primary modality through cross-modal fusion. Furthermore, CGFF employsa cross-temporal gating mechanism during feature fusion to emphasize salientchanges across time, replacing simple concatenation. Extensive experiments conducted on two large-scale benchmark datasets, THUMOS-14 and ActivityNetv1.2, demonstrate that FE-Net significantly enhances the performance of existingWTAL methods. These results validate the effectiveness of our proposed modulesand provide new insights for advancing temporal action localization under weak supervision.