HiFECap: Monocular High-Fidelity and Expressive Capture of Human Performances

HiFECap: Monocular High-Fidelity and Expressive Capture of Human Performances
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Monocular 3D human performance capture is indispensable for many applications in computer graphics and vision for enabling immersive experiences. However, detailed capture of humans requires tracking of multiple aspects, including the skeletal pose, the dynamic surface, which includes clothing, hand gestures as well as facial expressions. No existing monocular method allows joint tracking of all these components. To this end, we propose HiFECap, a new neural human performance capture approach, which simultaneously captures human pose, clothing, facial expression, and hands just from a single RGB video. We demonstrate that our proposed network architecture, the carefully designed training strategy, and the tight integration of parametric face and hand models to a template mesh enable the capture of all these individual aspects. Importantly, our method also captures high-frequency details, such as deforming wrinkles on the clothes, better than the previous works. Furthermore, we show that HiFECap outperforms the state-of-the-art human performance capture approaches qualitatively and quantitatively while for the first time capturing all aspects of the human.


💡 Research Summary

HiFECap introduces a novel monocular 3D human performance capture system that simultaneously recovers full‑body pose, dynamic clothing deformations, hand gestures, and facial expressions from a single RGB video. The authors identify a gap in existing monocular approaches, which either focus on naked body pose, hand pose, or facial expression, but none capture all components together, especially high‑frequency clothing details such as wrinkles. To address this, HiFECap is built around a three‑stage coarse‑to‑fine pipeline.

The first stage, PoseNet, is a ResNet‑50 based network that predicts a 27‑dimensional joint angle vector together with global rotation and translation. Supervision comes from multi‑view 2D keypoint detections (OpenPose) and a 2D keypoint loss, ensuring accurate skeletal articulation.

The second stage, Embedded Deformation Network (EDefNet), also uses a ResNet‑50 backbone to predict the parameters of an embedded deformation graph (node rotations A and translations T). This graph captures coarse, piece‑wise rigid deformations of the body and clothing. A per‑vertex rigidity weight r_i, computed offline, distinguishes near‑rigid regions (skin, shoes) from highly deformable cloth, allowing the network to apply stronger constraints where appropriate.

The third stage, DisplaceNet, is the core of high‑frequency detail recovery. An image encoder (U‑Net, called DUNet) extracts a 256×256×32 feature map from the input frame. These features are projected onto the currently posed and coarsely deformed mesh using a visibility‑aware rasterization function: visible vertices receive the exact pixel‑level feature at their projected (u, v) location, while occluded vertices receive the average image feature, enabling the network to infer plausible deformations for hidden parts. The projected features serve as node attributes for a graph convolutional network (DGCN), which outputs a per‑vertex displacement vector d_i in canonical space. A rigidity mask M zeroes out displacements for vertices marked as rigid, preserving the intended physical behavior.

Training is weakly supervised because ground‑truth per‑frame 3D geometry is unavailable for arbitrary clothing. The authors collect multi‑view video of the target actor in a studio, reconstruct per‑frame point clouds with multi‑view stereo (Agisoft Metashape), and use these as supervision. The loss suite includes:

  • Silhouette loss (L_sil) aligning projected mesh silhouettes with multi‑view masks.
  • 2D landmark loss (L_mk) aligning projected skeleton joints with detected keypoints.
  • Dense rendering loss (L_dr) that renders the textured mesh under estimated spherical‑harmonics lighting and penalizes pixel‑wise differences to the input images, encouraging correct surface shading and fine detail.
  • Chamfer loss (L_cf) between the deformed mesh and the stereo‑reconstructed point cloud, providing depth‑direction supervision.
  • Regularizers: as‑rigid‑as‑possible (ARAP) on the deformation graph, Laplacian smoothness, and isometry constraints, all weighted by material‑aware rigidity factors.

Training proceeds in stages: PoseNet is first trained (or pre‑trained) and then frozen. EDefNet is trained with L_sil, L_mk, and ARAP. Afterwards DisplaceNet is trained with the full loss suite, allowing the network to learn high‑frequency wrinkles.

A further contribution is the integration of parametric face (FLAME) and hand (MANO) models. The original template’s face and hand regions are replaced with these models, and a dedicated network predicts the corresponding expression and hand pose parameters from the same input image. This yields consistent, high‑resolution facial expressions and articulated finger motions within the same global mesh.

Quantitative experiments on a variety of clothing types (t‑shirts, skirts, dresses) and motions show that HiFECap reduces average vertex error by 20‑30 % compared to state‑of‑the‑art monocular methods such as MonoPerfCap, LiveCap, and DeepCap. Qualitative results demonstrate clear reconstruction of garment folds, facial nuances, and finger articulations that were previously missing. Inference runs at roughly 30 ms per frame on a modern GPU, enabling near‑real‑time applications.

In summary, HiFECap advances monocular performance capture by (1) a coarse‑to‑fine multi‑stage architecture, (2) a visibility‑ and rigidity‑aware graph‑based vertex displacement network for high‑frequency detail, (3) seamless incorporation of parametric face and hand models, and (4) a carefully designed weakly supervised loss formulation that leverages multi‑view silhouettes, dense rendering, and stereo point clouds. This combination allows a single RGB camera to produce temporally coherent, high‑fidelity 3D reconstructions of the entire human performer, opening new possibilities for film, gaming, AR/VR, and telepresence.


Comments & Academic Discussion

Loading comments...

Leave a Comment