Multi-modal 3D Pose and Shape Estimation with Computed Tomography

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In perioperative care, precise in-bed 3D patient pose and shape estimation (PSE) can be vital in optimizing patient positioning in preoperative planning, enabling accurate overlay of medical images for augmented reality-based surgical navigation, and mitigating risks of prolonged immobility during recovery. Conventional PSE methods relying on modalities such as RGB-D, infrared, or pressure maps often struggle with occlusions caused by bedding and complex patient positioning, leading to inaccurate estimation that can affect clinical outcomes. To address these challenges, we present the first multi-modal in-bed patient 3D PSE network that fuses detailed geometric features extracted from routinely acquired computed tomography (CT) scans with depth maps (mPSE-CT). mPSE-CT incorporates a shape estimation module that utilizes probabilistic correspondence alignment, a pose estimation module with a refined neural network, and a final parameters mixing module. This multi-modal network robustly reconstructs occluded body regions and enhances the accuracy of the estimated 3D human mesh model. We validated mPSE-CT using proprietary whole-body rigid phantom and volunteer datasets in clinical scenarios. mPSE-CT outperformed the best-performing prior method by 23% and 49.16% in pose and shape estimation respectively, demonstrating its potential for improving clinical outcomes in challenging perioperative environments.

💡 Research Summary

The paper introduces VIM‑PSE (Volumetric Imaging‑guided Multi‑modal In‑bed 3D Pose and Shape Estimation), a novel framework that fuses patient‑specific volumetric data (CT or MRI) with depth images to achieve highly accurate 3D pose and shape estimation for patients lying in bed. Traditional in‑bed PSE approaches rely on depth, RGB, infrared, or pressure maps, which suffer from severe occlusions caused by bedding and surgical drapes, leading to unreliable reconstruction of hidden body parts. Moreover, existing domain‑adaptation methods use generic population priors that cannot capture atypical anatomies, limiting their clinical applicability.

VIM‑PSE addresses these gaps by treating the volumetric scan as a fixed, occlusion‑free shape anchor. A skin‑surface point cloud (5 k points) is extracted from the CT/MRI volume, encoded with PointNet++, and used in a Probabilistic Correspondence Association (PCA) module. The PCA predicts soft assignments of each point to SMPL template vertices and an outlier probability, forming a Gaussian‑mixture likelihood. By iteratively updating posterior match probabilities and minimizing the negative log‑likelihood with respect to the shape coefficients β (while keeping pose fixed), the method obtains a patient‑specific shape that aligns tightly with the true torso surface.

For pose estimation, a modified BodyMap network processes only the top‑view depth map. After median filtering and resizing to 128 × 54, the depth image is passed through a ResNet‑1 backbone followed by an MLP that regresses joint rotations θ (axis‑angle) and global translation t. Height information derived from the volumetric scan is injected as an auxiliary input to resolve scale ambiguities inherent in depth‑only estimation.

The two modality streams are merged by a lightweight Cross‑modal Residual Fusion (CRF) module. CRF computes confidence weights (outlier probability for shape, depth confidence map for pose) and applies them to the respective feature vectors. Residual corrections are then exchanged between the streams, allowing high‑confidence shape features to refine pose predictions and vice‑versa. This dynamic, confidence‑aware fusion overcomes the naïve concatenation strategies of prior multimodal works, which treat all sensors equally regardless of occlusion.

Experimental validation comprises three parts. (1) A large‑scale MRI‑based simulation dataset (HIT, N = 300) demonstrates statistical robustness; VIM‑PSE achieves an average Vertex‑to‑Vertex (V2V) error of 0.38 cm, a 49 % improvement over the previous state‑of‑the‑art. (2) A real CT‑based phantom study (N = 1) shows clinical feasibility: torso V2V error drops to 0.26 cm, satisfying the ≤0.5 cm accuracy requirement for augmented‑reality surgical navigation (e.g., pedicle screw placement). (3) In‑vivo evaluation on six volunteers in realistic supine and lateral positions, with heavy bedding occlusion, yields a mean pose MAE of 7.2° and shape MAE of 0.12 cm, confirming consistent performance across diverse poses.

The contributions are threefold: (i) the first in‑bed PSE pipeline that explicitly fuses a patient‑specific volumetric shape prior with depth‑based pose estimation; (ii) a confidence‑guided residual fusion module that dynamically balances the two modalities; (iii) the release of a high‑fidelity phantom dataset containing paired CT volumes, depth images, surface scans, and SMPL ground truth, establishing a benchmark for future multimodal PSE research. Limitations include reliance on pre‑acquired CT/MRI scans (i.e., the shape prior is static) and the need for low‑dose or rapid volumetric imaging to enable real‑time clinical workflows. Future work will explore integration of ultra‑low‑dose CT or 3D ultrasound as on‑the‑fly shape priors and scaling the system to multi‑patient, multi‑bed environments.

In summary, VIM‑PSE demonstrates that leveraging routinely acquired volumetric imaging as a patient‑specific anatomical anchor, combined with a smart residual fusion strategy, can dramatically improve 3D pose and shape estimation under severe occlusion, opening the door to more reliable AR‑guided surgery, pressure‑injury prevention, and postoperative monitoring in peri‑operative care.

Multi-modal 3D Pose and Shape Estimation with Computed Tomography

💡 Research Summary

Comments & Academic Discussion

Leave a Comment