SPARK: Sparse-Perception Action Recognition with Keyframes for Quadruped Robots

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study introduces a lightweight Human Action Recognition (HAR) model designed for computational efficiency and real-world applications. Faced with the challenge of processing large scale video data, the proposed approach strategically selects only the most informative keyframes, thereby, significantly reducing data redundancy. The model leverages a high-performing pre-trained DaViT backbone for feature extraction, combined with a Temporal Transformer that effectively captures both spatial details and temporal dynamics from sparse keyframes. The proposed method reduces the high computational cost associated with traditional architectures such as 3D CNNs, LSTMs that process every single frame. To validate its practical utility, the proposed model was deployed on a quadruped robot, establishing an efficient inference pipeline in which the robot captures video and performs on-device action recognition. The proposed method demonstrates a significant step towards applying complex HAR tasks in resource-constrained, robotic environments.

Article activity feed