PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space

PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i.e., Human3.6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by $14.7%$ compared to SOTA methods on the challenging conditions of Human3.6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.


💡 Research Summary

PandaPose tackles the long‑standing problem of lifting a 3D human pose from a single RGB image, a task that is notoriously difficult due to the inherent ambiguity of depth and the frequent occurrence of self‑occlusion. Existing image‑based lifting methods typically follow a direct joint‑to‑joint mapping: they either regress 3D joint coordinates directly from predicted 2D keypoints or augment this regression with image features extracted from a 2D pose estimator. While these approaches have achieved impressive results, they suffer from two fundamental drawbacks. First, any error in the upstream 2D pose prediction propagates directly to the 3D output, making the system fragile to noisy or partially missing keypoints. Second, relying solely on 2D image features provides insufficient depth cues, which hampers the ability to resolve depth ambiguities and to recover joints that are occluded in the image plane.

The core contribution of PandaPose is the introduction of a 3D anchor space as an intermediate representation that bridges the 2D pose prior and the final 3D pose. The anchor space consists of two complementary components: (1) Joint‑wise adaptive 3D anchors that are generated on‑the‑fly from the normalized 2D keypoints, and (2) a set of global fixed anchors that provide a stable reference frame. For each joint, a small cluster of local anchors is produced by applying a learned linear transformation to the 2D coordinates, yielding K 3‑D offset vectors that are added to the (x, y, 0) base position. This adaptive anchor generation dramatically reduces the average offset distance (from ~154 mm with fixed anchors to ~70 mm), thereby offering a robust coarse prior that is less sensitive to 2D noise.

To address depth ambiguity, PandaPose predicts a joint‑wise depth distribution instead of a single dense depth map. For every joint a separate depth probability map is estimated, supervised by the ground‑truth joint depth values (sparse supervision). This design enables the network to distinguish between joints that overlap in the 2D projection but lie at different depths, a scenario that commonly leads to failure in conventional methods. The depth maps are embedded into a feature vector (F_D) that is later fused with visual cues.

The visual features themselves are extracted efficiently using a 2D‑pose‑guided sampling strategy. Rather than processing the entire feature pyramid, the method samples a small spatial window around each predicted 2D joint, which both reduces memory consumption and filters out background clutter. The sampled visual features (F_I) are concatenated with the depth embedding (F_D) and lifted into a depth‑aware 3D feature space.

All anchors are encoded as learnable queries (Q_anchor) and fed into a Transformer‑style decoder that performs anchor‑feature interaction. Through cross‑attention, each anchor query aggregates information from the lifted 3D features, resulting in a unified representation that encodes the original 3D anchor position, visual context, and depth cues. The decoder outputs, for each joint, a set of anchor‑to‑joint offsets (Δ_j,k) and weights (w_j,k). The final 3D joint position is computed as a weighted sum of the offset‑augmented anchors: \


Comments & Academic Discussion

Loading comments...

Leave a Comment