Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image

February 18, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image
ArXiv ID: 2512.17773
Date: 2025-12-19
Authors: Simon Giebenhain, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Zhe Chen, Matthias Nießner

📝 Abstract

Neural Parametric Head Models (NPHMs) are a recent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. However, fitting NPHMs to visual inputs is notoriously challenging due to the expressive nature of their underlying latent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geometry and accurate facial expressions. For broad generalization, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct supervision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstructions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimization against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face reconstruction quality that can run at scale on in-the-wild data.

💡 Deep Analysis

📄 Full Content

Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image Simon Giebenhain1 Tobias Kirschstein1 Liam Schoneveld2 Davide Davoli3∗ Zhe Chen2 Matthias Nießner1 1Technical University of Munich 2Woven by Toyota 3Toyota Motor Europe Figure 1. Pix2NPHM is a feed-forward network that predicts NPHM [12] latent codes from a single image. The latent codes can be further optimized at test-time to obtain more detailed 3D reconstructions. Here, we show mesh overlays showcasing well-aligned fittings of diverse head shapes and expressions under strong lighting conditions and occlusions. Website: https://simongiebenhain.github.io/Pix2NPHM/ Abstract Neural Parametric Head Models (NPHMs) are a re- cent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. How- ever, fitting NPHMs to visual inputs is notoriously chal- lenging due to the expressive nature of their underlying la- tent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geome- try and accurate facial expressions. For broad generaliza- tion, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct super- vision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstruc- tions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimiza- tion against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face recon- struction quality that can run at scale on in-the-wild data. 1. Introduction Reconstructing faces in 3D, tracking facial movements, and ultimately extracting animation signals for virtual avatars are fundamental problems in many domains such as the computer games and movie industry, telecommunication, and AR/VR. Arguably the most relevant sub-task is 3D face reconstruction from a single image due to the vast availabil- ity of image collections as well as straight-forward exten- sions to sequential tracking. In order to solve the underconstrained reconstruction problem, 3d morphable models (3DMMs) [2] have evolved as industry and research standard due to their concise low- dimensional parametric representation, which lead to a plethora of algorithms build on top of 3DMMs. With the advancement of deep-learning methods, photometric track- ing [45] approaches, have been augmented with additional priors, such as facial landmark detection, or direct 3DMM parameter regression from RGB signal [8, 10, 39, 56, 63]. Recently, additional priors, such as dense landmarks [41, 52] and surface normals [15] have further improved recon- structions. Due to such methods that enable fitting in even the most challenging scenarios, 3DMMs have become an essential component of photo-realistic avatars [14, 33], gen- eralized avatars [4, 23, 24], and even controllable generative . ∗Providing contracted services for Toyota arXiv:2512.17773v1 [cs.CV] 19 Dec 2025 diffusion models for faces [22, 32, 42, 43, 60]. While 3DMMs have achieved great success in these do- mains, we argue that their concise parametric representation comes at the cost of geometric expressiveness – i.e., mod- ern 3DMMs, such as FLAME [25], are unable to model high-fidelity geometric detail. Therefore, a more recent line of work has developed neural parametric head mod- els (NPHMs) [12, 13, 54, 57] for increased representational capacity, as shown in Fig. 2. This increased model capac- ity, however, makes image-based reconstruction challeng- ing due its expressive parameter space. MonoNPHM [13] has attempted to reconstruct NPHM parameters from a sin- gle image. However, their purely photometric fitting ap- proach remained slow and brittle in real-world applications To this end, we propose a robust and high-fidelity fitting frame-work, yielding a first-class tool for face reconstruc- tion and tracking based on NPHM [12, 13]. Our approach addresses the two main challenges of neu- ral parametric model fitting: underconstrained optimiza- tion and reconstruction speed. This is achieved by tailor- ing a transformer-based feed-forward predictor for NPHM parameters from a single image. As a highly data-driven approach, large-scale high-quality training data is essential. To this end, we curated a large collection of publicly avail- able 3D face datasets and fitted MonoNPHM against it, re- sulting in a total of 102K registrations, which will be shared with the research community. Despite these efforts, we find that training on large-scale 2D video datasets using a self- supervised geometric loss based on estima

📄 Read Full PDF on ArXiv