HandPointSelfFusion: Combining Self-Supervised Point Cloud Pretraining and 2D Skeleton Fine-Tuning for Accurate Hand Pose Estimation

Lihuang She
Xiangli Guo
Yehan Chen
Shengtian Liu

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the increasing demand for natural human-computer interaction, 3D hand gesture estimation has become a critical research focus in computer vision. Despite recent advancements, existing datasets remain limited, and challenges such as occlusion, noise, and complex environments continue to impede the accuracy and robustness of current methods. To address these issues, we propose HandPointSelfFusion, a self-supervised learning framework for 3D hand pose estimation that synergistically integrates point cloud Transformers with 2D hand skeleton priors.HandPointSelfFusion adopts a masking-based modeling strategy during pretraining, enabling the model to extract robust global and local features from unlabeled point cloud data. This self-supervised paradigm markedly improves performance in sparse and noisy scenarios. After pretraining, a 2D hand skeleton task layer is introduced during fine-tuning to incorporate structural priors of hand joints, thereby further optimizing pose estimation.Comprehensive experiments conducted on the MSRA, ICVL, and NYU datasets demonstrate that HandPointSelfFusion achieves state-of-the-art performance, with average errors of 6.98 mm, 6.21 mm, and 8.67 mm, respectively. Ablation studies further validate the efficacy of the proposed masking strategies and the integration of 2D skeletal priors. Collectively, HandPointSelfFusion presents a novel and effective framework for accurate and efficient 3D hand gesture estimation, holding substantial promise for applications in robotics, augmented reality, and human-computer interaction.

Version published to 10.21203/rs.3.rs-6946006/v1 on Research Square
Jun 26, 2025

Multi-Person 2D Human Pose Estimation: A Benchmark for Real-Time Applications

This article has 4 authors:
1. Tomislav Prusina
2. Juraj Benić
3. Domagoj Ševerdija
4. Domagoj Matijević
This article has no evaluationsLatest version Jul 23, 2025
Place Recognition Meet Multiple Modalities: A Comprehensive Review, Current Challenges and Future Development

This article has 4 authors:
1. Zhenyu Li
2. Tianyi Shang
3. Pengjie Xu
4. Zhaojun Deng
This article has no evaluationsLatest version Jun 17, 2025
Weakly Supervised Temporal Action Localization Based on Feature Enhancement

This article has 2 authors:
1. Hongying Zhang
2. Yi Yao
This article has no evaluationsLatest version Jun 9, 2025

Listed in

Abstract

Article activity feed

Related articles

Multi-Person 2D Human Pose Estimation: A Benchmark for Real-Time Applications

Place Recognition Meet Multiple Modalities: A Comprehensive Review, Current Challenges and Future Development

Weakly Supervised Temporal Action Localization Based on Feature Enhancement