A Spatiotemporal Bidirectional Mamba Network with Global–Local Skeletal Enhancement for 3D Human Pose Estimation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
3D human pose estimation (HPE) is a cornerstone task in computer vision with diverse applications, where lifting 2D pose sequences to 3D representations has attracted significant interest. Transformer-based approaches have demonstrated robust performance but are hampered by quadratic computational complexity and insufficient bidirectional modeling capabilities. The recently introduced Mamba model mitigates these limitations through state-space models (SSMs) offering linear complexity and effective long-range dependencies; however, it falls short in modeling local skeletal interactions essential for human motion.To address this, we present BSTMamba, a bidirectional spatiotemporal SSM architecture designed specifically for monocular 3D HPE. BSTMamba integrates efficient global sequence modeling with localized convolutions and dynamic gating mechanisms to capture intricate spatiotemporal dependencies. For enhanced robustness and generalization, we introduce DisruptEnhance, a residual-compensated joint-order perturbation module that randomly disrupts joint orders at both global (full-skeleton) and local (body-part) scales, followed by feature compensation via a lightweight residual subnet. Comprehensive evaluations on the Human3.6M and MPI-INF-3DHP datasets reveal that BSTMamba attains state-of-the-art accuracy while requiring fewer parameters and lower multiply-accumulate operations (MACs) compared to prior methods.