TCFNet: An End-to-End Framework for Multimodal Action Quality Assessment via Temporal Enhancement and Contrastive Fusion

Zhenxian Lin
Minghui Zhang
Chengmao Wu
Mingzhu Zhang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Existing Action Quality Assessment (AQA) methods have limitations, such as over reliance on single modalities, inadequate long-term temporal modeling, and modality alignment biases in multimodal models. To address these issues, we propose TCFNet, an AQA approach from the perspective of multimodal fusion. Compared to previous methods, TCFNet comprehensively integrates complementary information from multiple modalities, including RGB, optical flow, and audio. It also incorporates specialized modules to capture long-range temporal dependencies and enhance rhythmic consistency. In the single modal processing stage, we first introduce a Temporal Feature Enhancement Module (TFEM) to capture the sequential dependencies. This is followed by a three-layer pyramid network to extract multi-scale features. Then, the isomorphic multimodal fusion network receives the resulting RGB, optical flow, and audio features as 1 input. We incorporate the cross-trimodal Information Noise Contrastive Estimation loss into the loss function. This promotes feature similarity alignment and alleviates semantic and temporal discrepancies between modalities. This mechanism facilitates improved alignment of feature similarities and mitigates semantic and temporal discrepancies across modalities. Experimental results demonstrate that, compared to state-of-the-art AQA methods, our approach achieves average improvements in Spearman’s rank correlation coefficient of 4.3% on the RG dataset and 1.8% on the Fis-V dataset.

Version published to 10.21203/rs.3.rs-7979645/v1 on Research Square
Dec 19, 2025

<p style="-qt-block-indent: 0; text-indent: 0px; margin: 0px;">AttnLink: Enhancing Cross-Modal Fusion for Robust Image-to-PointCloud Place Recognition

This article has 2 authors:
1. Ziyu Fang
2. Minghao Ye
This article has no evaluationsLatest version Jan 14, 2026
TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene

This article has 3 authors:
1. Hankiz Yilahun
2. Chaobo Song
3. Askar Hamdulla
This article has no evaluationsLatest version Feb 26, 2026
Research on multimodal conditional diffusion image translation technology based on dynamic door control and attention masking

This article has 3 authors:
1. xiaoli zhang
2. mengxiang liu
3. yusi wang
This article has no evaluationsLatest version Feb 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

<p style="-qt-block-indent: 0; text-indent: 0px; margin: 0px;">AttnLink: Enhancing Cross-Modal Fusion for Robust Image-to-PointCloud Place Recognition

TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene

Research on multimodal conditional diffusion image translation technology based on dynamic door control and attention masking