Multistream Fusion and Relational Network for Recognizing Actions from Skeleton Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a novel multi-stream fusion and relational network action recognition model from skeleton data, called the attention relational long short-term memory network (ARN-LSTM), designed to address the challenge of simultaneously capturing spatial motion and temporal dynamics in action sequences. Traditional methods often focus solely on spatial or temporal features, limiting their ability to comprehend complex human activities fully. Our proposed model integrates joint, motion, and temporal information through a multi-stream fusion architecture. Specifically, it comprises a joint stream for extracting skeleton features, a temporal stream for capturing dynamic temporal features, and an ARN-LSTM block that utilizes Time-Distributed Long Short-Term Memory (TD-LSTM) layers followed by an Attention Relational Network (ARN) to model temporal relations. The outputs from these streams are fused in a fully connected layer to provide the final action prediction. Evaluations on the NTU RGB + D 60 and NTU RGB + D 120 datasets outperform the superior performance of our model, particularly in group activity recognition.