Adaptive Sparse Multimodal Transformer for Efficient Action Recognition on Resource-Constrained Edge Devices

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multimodal sensing platforms such as wearable devices, mobile robots, and smart environments increasingly require real-time interpretation of visual, acoustic, and inertial data under stringent computational and energy constraints. Although Transformer-based architectures provide strong representational capacity, their quadratic attention complexity limits practical deployment on resource-constrained systems. This paper presents the Adaptive Sparse Multimodal Transformer (ASMT), a content-adaptive sparse attention framework designed for efficient multimodal action recognition. ASMT introduces a lightweight token-importance gating module that selects a compact subset of informative tokens across modalities, enabling attention computation on a reduced sequence while preserving cross-modal dependencies. Unlike fixed-pattern sparsity methods, ASMT dynamically adapts token selection to input characteristics, improving efficiency without degrading accuracy. Experiments on two widely used multimodal benchmarks, MMAct and UTD-MHAD, demonstrate that ASMT achieves accuracy comparable to state-of-the-art multimodal Transformers while reducing attention FLOPs by up to 63 percent and lowering total inference latency by 41 percent on edge-oriented hardware. These results indicate that ASMT provides a practical and scalable architecture for real-time multimodal inference in embedded and mobile applications.

Article activity feed