Adaptive Sparse Multimodal Transformer for Efficient Action Recognition on Resource-Constrained Edge Devices

Nnaemeka Kingsley Ugwumba

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multimodal sensing platforms such as wearable devices, mobile robots, and smart environments increasingly require real-time interpretation of visual, acoustic, and inertial data under stringent computational and energy constraints. Although Transformer-based architectures provide strong representational capacity, their quadratic attention complexity limits practical deployment on resource-constrained systems. This paper presents the Adaptive Sparse Multimodal Transformer (ASMT), a content-adaptive sparse attention framework designed for efficient multimodal action recognition. ASMT introduces a lightweight token-importance gating module that selects a compact subset of informative tokens across modalities, enabling attention computation on a reduced sequence while preserving cross-modal dependencies. Unlike fixed-pattern sparsity methods, ASMT dynamically adapts token selection to input characteristics, improving efficiency without degrading accuracy. Experiments on two widely used multimodal benchmarks, MMAct and UTD-MHAD, demonstrate that ASMT achieves accuracy comparable to state-of-the-art multimodal Transformers while reducing attention FLOPs by up to 63 percent and lowering total inference latency by 41 percent on edge-oriented hardware. These results indicate that ASMT provides a practical and scalable architecture for real-time multimodal inference in embedded and mobile applications.

Version published to 10.21203/rs.3.rs-8168079/v1 on Research Square
Nov 26, 2025

SPARK: Sparse-Perception Action Recognition with Keyframes for Quadruped Robots

This article has 2 authors:
1. Sehun Park
2. Andrew Jaeyong Choi
This article has no evaluationsLatest version Dec 10, 2025
Low-complexity pedestrian intent prediction using contextual stacked ensemble learning

This article has 4 authors:
1. Chia-Yen Chiang
2. Yasmin Fathy
3. Gregory Slabaugh
4. Mona Jaber
This article has no evaluationsLatest version Jan 30, 2026
Human Activity Recognition in the Deep Learning Era: Different Modalities, Recent Advances in Applications, and Emerging Techniques

This article has 2 authors:
1. Mohammad Osman Khan
2. Imran Khan Apu
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

SPARK: Sparse-Perception Action Recognition with Keyframes for Quadruped Robots

Low-complexity pedestrian intent prediction using contextual stacked ensemble learning

Human Activity Recognition in the Deep Learning Era: Different Modalities, Recent Advances in Applications, and Emerging Techniques