HandPointSelfFusion: Combining Self-Supervised Point Cloud Pretraining and 2D Skeleton Fine-Tuning for Accurate Hand Pose Estimation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the increasing demand for natural human-computer interaction, 3D hand gesture estimation has become a critical research focus in computer vision. Despite recent advancements, existing datasets remain limited, and challenges such as occlusion, noise, and complex environments continue to impede the accuracy and robustness of current methods. To address these issues, we propose HandPointSelfFusion, a self-supervised learning framework for 3D hand pose estimation that synergistically integrates point cloud Transformers with 2D hand skeleton priors.HandPointSelfFusion adopts a masking-based modeling strategy during pretraining, enabling the model to extract robust global and local features from unlabeled point cloud data. This self-supervised paradigm markedly improves performance in sparse and noisy scenarios. After pretraining, a 2D hand skeleton task layer is introduced during fine-tuning to incorporate structural priors of hand joints, thereby further optimizing pose estimation.Comprehensive experiments conducted on the MSRA, ICVL, and NYU datasets demonstrate that HandPointSelfFusion achieves state-of-the-art performance, with average errors of 6.98 mm, 6.21 mm, and 8.67 mm, respectively. Ablation studies further validate the efficacy of the proposed masking strategies and the integration of 2D skeletal priors. Collectively, HandPointSelfFusion presents a novel and effective framework for accurate and efficient 3D hand gesture estimation, holding substantial promise for applications in robotics, augmented reality, and human-computer interaction.