XTinyHAR: A Tiny Inertial Transformer for Human Activity Recognition via Multimodal Knowledge Distillation and Explainable AI

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Human Activity Recognition (HAR) is essential for applications such as healthcare monitoring, fitness tracking, and smart environments, yet deploying accurate and interpretable models on resource-constrained devices remains challenging. In this paper, we propose XTinyHAR, a lightweight, transformer-based unimodal framework trained via cross-modal knowledge distillation from a multimodal teacher. Our model incorporates temporal positional embeddings and attention rollout to enhance sequential feature extraction and interpretability. Evaluated on UTD-MHAD and MM-Fit datasets, XTinyHAR achieves state- of-the-art performance with test accuracies of 98.71% and 98.55%, F1-scores of 98.71% and 98.55%, and Cohen’s Kappa scores above 0.98, while maintaining a compact footprint of 2.45 MB, fast inference latency (3.1 ms CPU, 1.2 ms GPU), and low computational cost (11.3M FLOPs). Extensive ablation studies confirm the contribution of each component, and subject-wise evaluations demonstrate strong generalization across users. These results highlight XTinyHAR’s potential as a high-performance, interpretable, and deployable solution for real-time HAR on edge devices.

Article activity feed