A Scalable Masked Image Modeling based Self-Supervised Approach for Hand Mesh Estimation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

With an enormous number of hand images generated over time, unleashing pose knowledge from unlabeled images for supervised hand mesh estimation is an emerging yet challenging topic. Semi-supervised and self-supervised approaches have been proposed to alleviate this issue, but the reliance on high-quality fine-grained keypoint detection models or conventional ResNet backbones limits them. In this paper, inspired by the rapid progress of Masked Image Modeling (MIM) and Vision Transformer (ViT) in visual classification tasks, we propose a novel self-supervised pre-training strategy for regressing 3D hand mesh parameters. Our approach involves a unified and multi-granularity strategy with a pseudo keypoint alignment module in the teacher-student framework for learning pose-aware semantic class tokens. We adopt a self-distillation manner between teacher and student network based on MIM pre-training for patch tokens with detailed locality. To better fit low-level regression tasks, we also incorporate masked pixel reconstruction tasks for multi-level representation learning. Additionally, we designed a strong pose estimation baseline using a simple vanilla Vision Transformer (ViT) as the backbone and attached a Pyramidal Mesh Alignment Feedback (PyMAF) head for mesh regression. Extensive experiments demonstrate that our proposed approach, named HandMIM, achieves state-of-the-art (SOTA) performance on various datasets. Notably, HandMIM outperforms specially optimized architectures, achieving an 8.00mm PAVPE (Procrustes Alignment Vertex-Point-Error) on the challenging HO3Dv2 test set, thereby establishing new state-of-the-art records in 3D hand mesh estimation.

Article activity feed