TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition

Majid Joudaki
Mehdi Imani
Hamid R. Arabnia

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Human Action Recognition has seen significant advances through Transformer-based architectures, yet achieving nuanced understanding often requires fusing multiple data modalities. Standard models relying solely on RGB video can struggle with actions defined by subtle motion cues rather than appearance. This paper introduces TransMODAL, a novel dual-stream Transformer that synergistically fuses spatiotemporal appearance features from a pre-trained VideoMAE backbone with explicit skeletal kinematics from a state-of-the-art pose estimation pipeline (RT-DETR + ViTPose++). We propose two key architectural innovations to enable effective and efficient fusion: a CoAttentionFusion module that facilitates deep, iterative cross-modal feature exchange between the RGB and pose streams, and an efficient AdaptiveSelector mechanism that dynamically prunes less-informative spatiotemporal tokens to reduce computational overhead. Evaluated on three challenging benchmarks, TransMODAL demonstrates robust generalization, achieving accuracies of 98.5% on KTH, 96.9% on UCF101, and 84.2% on HMDB51. These results significantly outperform a strong VideoMAE-only baseline and are competitive with state-of-the-art methods, demonstrating the profound impact of explicit pose guidance. TransMODAL presents a powerful and efficient paradigm for composing pre-trained foundation models to tackle complex video understanding tasks by providing a fully reproducible implementation and strong benchmark results.

Version published to 10.20944/preprints202507.2386.v1
Jul 29, 2025

Weakly Supervised Temporal Action Localization Based on Feature Enhancement

This article has 2 authors:
1. Hongying Zhang
2. Yi Yao
This article has no evaluationsLatest version Jun 9, 2025
Place Recognition Meet Multiple Modalities: A Comprehensive Review, Current Challenges and Future Development

This article has 4 authors:
1. Zhenyu Li
2. Tianyi Shang
3. Pengjie Xu
4. Zhaojun Deng
This article has no evaluationsLatest version Jun 17, 2025
Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?

This article has 3 authors:
1. Karim Rajaei
2. Radoslaw Martin Cichy
3. Hamid Soltanian-Zadeh
This article has no evaluationsLatest version Jul 24, 2025

Listed in

Abstract

Article activity feed

Related articles

Weakly Supervised Temporal Action Localization Based on Feature Enhancement

Place Recognition Meet Multiple Modalities: A Comprehensive Review, Current Challenges and Future Development

Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?