A Multi-Granular Joint Tracing Transformer for Video-Based 3D Human Pose Estimation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Human pose estimation from monocular images captured by motion capture cameras is a crucial task with a wide range of downstream applications, e.g., action recognition, motion transfer, and movie making. However, previous methods have not effectively addressed the depth blur problem while considering the temporal correlation of individual and multiple body joints together. We address the issue by simultaneously exploiting the temporal information at both single-joint and multiple-joint granularities. Inspired by the observation that different body joints have different moving trajectories and can be correlated with others, we proposed an approach called the \underline{M}ulti-granularity j\underline{O}int \underline{T}racing \underline{T}ransformer (MOTT). MOTT consists of two main components: (1) a spatial transformer that encodes each frame to obtain spatial embeddings of all joints, and (2) a multi-granularity temporal transformer that includes both a holistic temporal transformer to handle the temporal correlation between all joints in consecutive frames and a joint tracing temporal transformer to process the temporal embedding of each particular joint. The outputs of the two branches are fused to produce accurate 3D human poses. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that MOTT effectively encodes the spatial and temporal dependencies between body joints and outperforms previous methods in terms of mean per joint position error.

Article activity feed