XTinyHAR: A Tiny Inertial Transformer for Human Activity Recognition via Multimodal Knowledge Distillation and Explainable AI

Ismail Lamaakal
Chaymae Yahyati
yassine Maleh
Khalid El Makkaoui
Ibrahim Ouahbi
Ahmed A. Abd El-Latif
Mariam Zomorodi
Basma Abd El-Rahiem

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Human Activity Recognition (HAR) is essential for applications such as healthcare monitoring, fitness tracking, and smart environments, yet deploying accurate and interpretable models on resource-constrained devices remains challenging. In this paper, we propose XTinyHAR, a lightweight, transformer-based unimodal framework trained via cross-modal knowledge distillation from a multimodal teacher. Our model incorporates temporal positional embeddings and attention rollout to enhance sequential feature extraction and interpretability. Evaluated on UTD-MHAD and MM-Fit datasets, XTinyHAR achieves state- of-the-art performance with test accuracies of 98.71% and 98.55%, F1-scores of 98.71% and 98.55%, and Cohen’s Kappa scores above 0.98, while maintaining a compact footprint of 2.45 MB, fast inference latency (3.1 ms CPU, 1.2 ms GPU), and low computational cost (11.3M FLOPs). Extensive ablation studies confirm the contribution of each component, and subject-wise evaluations demonstrate strong generalization across users. These results highlight XTinyHAR’s potential as a high-performance, interpretable, and deployable solution for real-time HAR on edge devices.

Version published to 10.21203/rs.3.rs-7284766/v1 on Research Square
Sep 2, 2025

Multimodal Spatio-Temporal Attention Networks with Multi-Head Residual Recurrent Encoding for Human Activity Tracking

This article has 3 authors:
1. Monirul Islam Mahmud
2. Md Shihab Reza
3. Hafeza Akter
This article has no evaluationsLatest version Oct 6, 2025
Deep Learning Algorithms for Human Activity Recognition in Manual Material Handling Tasks

This article has 3 authors:
1. Giulia Bassani
2. Carlo Alberto Avizzano
3. Alessandro Filippeschi
This article has no evaluationsLatest version Oct 2, 2025
Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition

This article has 4 authors:
1. Hanyou Huang
2. Changnan Jiang
3. Ziyuan Zhang
4. Heqing Ouyang
This article has no evaluationsLatest version Sep 2, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal Spatio-Temporal Attention Networks with Multi-Head Residual Recurrent Encoding for Human Activity Tracking

Deep Learning Algorithms for Human Activity Recognition in Manual Material Handling Tasks

Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition