TCFNet: An End-to-End Framework for Multimodal Action Quality Assessment via Temporal Enhancement and Contrastive Fusion

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Existing Action Quality Assessment (AQA) methods have limitations, such as over reliance on single modalities, inadequate long-term temporal modeling, and modality alignment biases in multimodal models. To address these issues, we propose TCFNet, an AQA approach from the perspective of multimodal fusion. Compared to previous methods, TCFNet comprehensively integrates complementary information from multiple modalities, including RGB, optical flow, and audio. It also incorporates specialized modules to capture long-range temporal dependencies and enhance rhythmic consistency. In the single modal processing stage, we first introduce a Temporal Feature Enhancement Module (TFEM) to capture the sequential dependencies. This is followed by a three-layer pyramid network to extract multi-scale features. Then, the isomorphic multimodal fusion network receives the resulting RGB, optical flow, and audio features as 1 input. We incorporate the cross-trimodal Information Noise Contrastive Estimation loss into the loss function. This promotes feature similarity alignment and alleviates semantic and temporal discrepancies between modalities. This mechanism facilitates improved alignment of feature similarities and mitigates semantic and temporal discrepancies across modalities. Experimental results demonstrate that, compared to state-of-the-art AQA methods, our approach achieves average improvements in Spearman’s rank correlation coefficient of 4.3% on the RG dataset and 1.8% on the Fis-V dataset.

Article activity feed