STGSFormer: A 3d Human Pose Estimation Model That Integrates GCN and Self-attention in the Spatio-temporal Domain

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Most methods that combine Transformer and graph convolutional network (GCN) for 3D human pose estimation (HPE) often overlook feature disparity during the fusion process. Additionally, GCNs are typically limited to capturing the spatial relationships between local joints, and cannot fully capture the temporal dependencies between adjacent frames. To address these problems, we propose a network model that integrates GCN and self-attention of Transformer (STGSFormer). STGSFormer integrates the global dependencies captured by self-attention between different joints or frames into GCN, enabling GCN to consider global relations while processing local information. This effectively alleviates the issue of feature discrepancies. Furthermore, we propose a dynamic temporal GCN block (DFGCN), which integrates temporal distance information to enhance the feature representation capability of the temporal GCN. STGSFormer is evaluated on the Human3.6M and MPI-INF-3DHP datasets using the MPJPE metric, achieving results of 40.8 mm and 17.3 mm, respectively. These results demonstrate the superior performance of the proposed model.

Article activity feed