SBAHGNet:3D Human Pose Estimation via Skeleton-Biased Attention and High-Frequency Enhanced Graph Convolution
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Monocular 3D human pose estimation is challenged by depth ambiguity and complex articulation, which complicate feature modeling and demand robust spatio-temporal representations. Although existing methods have advanced spatio-temporal modeling, limitations remain: graph convolutional network (GCN) exhibits low-pass behavior that, as depth increases, attenuates high-frequency geometric details in joint trajectories and thus degrades depth accuracy; and standard self-attention does not explicitly encode skeletal topology, resulting in indirect modeling of bone connectivity. To address these issues, we propose SBAHGNet, a dual-branch spatio-temporal feature-fusion network. In the GCN branch, a Multi-Scale High-Frequency Enhancement (MSHFE) module—applied after feature aggregation-recovers high-frequency geometric cues lost to GCN smoothing, improving fine-grained depth representation. In the attention branch, a Skeletal-Biased Attention (SBA) module injects a learnable skeletal bias into spatial attention to explicitly encode skeletal topology and strengthen structural modeling. Complementary features from both branches are adaptively fused for final 3D pose regression. Extensive experiments on Human3.6M and MPI-INF-3DHP validate our approach. With detected 2D keypoints, SBAHGNet attains 37.24 mm MPJPE (P1) and 31.57 mm PA-MPJPE (P2) on Human3.6M (12.38 mm with ground-truth 2D), and 13.83 mm MPJPE, 99.02% PCK@150mm, and 88.22 AUC on MPI-INF-3DHP. With only 18.3M parameters, the model achieves a favorable accuracy–efficiency trade-off and outperforms many comparable methods.