XTinyHAR: A Tiny Inertial Transformer for Human Activity Recognition via Multimodal Knowledge Distillation and Explainable AI
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Human Activity Recognition (HAR) is essential for applications such as healthcare monitoring, fitness tracking, and smart environments, yet deploying accurate and interpretable models on resource-constrained devices remains challenging. In this paper, we propose XTinyHAR, a lightweight, transformer-based unimodal framework trained via cross-modal knowledge distillation from a multimodal teacher. Our model incorporates temporal positional embeddings and attention rollout to enhance sequential feature extraction and interpretability. Evaluated on UTD-MHAD and MM-Fit datasets, XTinyHAR achieves state- of-the-art performance with test accuracies of 98.71% and 98.55%, F1-scores of 98.71% and 98.55%, and Cohen’s Kappa scores above 0.98, while maintaining a compact footprint of 2.45 MB, fast inference latency (3.1 ms CPU, 1.2 ms GPU), and low computational cost (11.3M FLOPs). Extensive ablation studies confirm the contribution of each component, and subject-wise evaluations demonstrate strong generalization across users. These results highlight XTinyHAR’s potential as a high-performance, interpretable, and deployable solution for real-time HAR on edge devices.