WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering

Figure 1 . WildCap. Given a smartphone video captured in the wild (4 selected frames shown above), our method reconstructs high-quality facial assets, which can be exported to graphics engines like Blender for photo-realistic rendering in new environments.

💡 Research Summary

WildCap introduces a hybrid inverse‑rendering pipeline that can reconstruct high‑fidelity 3D facial assets from ordinary smartphone videos captured in uncontrolled outdoor environments. The system operates on a small set of automatically selected key frames (four by default) and produces a complete asset package consisting of a detailed mesh, high‑resolution albedo, normal, and specular maps, as well as physically based illumination parameters that can be directly imported into graphics engines such as Blender for photorealistic rendering in new scenes.

Input handling and preprocessing – The method begins by detecting faces in the input video, extracting 2‑D landmarks, and selecting four frames that maximize pose diversity and illumination variation while maintaining high detection confidence. Camera intrinsics are estimated from the smartphone metadata and refined through a bundle‑adjustment‑like process.

Coarse global reconstruction – A conventional 3D Morphable Model (3DMM) such as FLAME or BFM is fitted to the selected frames. This step provides an initial estimate of global shape, expression‑neutral albedo, and per‑frame camera poses. The fitting loss combines landmark reprojection error and an L2 pixel‑wise color term, encouraging multi‑view consistency.

Hybrid fine‑level inverse rendering – The core contribution lies in a two‑branch refinement stage. First, a neural texture map is learned over the UV layout of the coarse mesh. The texture is represented by a multilayer perceptron that outputs per‑pixel albedo, normal perturbation, and specular reflectance. Second, a physically based illumination model is introduced, consisting of spherical‑harmonics environment lighting plus a set of point light sources. Both the neural texture and lighting parameters are optimized jointly using a differentiable renderer that back‑propagates color, normal, and specular residuals from the rendered views to the network weights.

Implicit surface refinement – To push geometric accuracy beyond the resolution of the 3DMM, the authors embed the mesh in a signed‑distance‑function (SDF) implicit representation. The SDF is learned with a second MLP, constrained by the coarse mesh as a zero‑level set and regularized with Laplacian smoothness. After convergence, marching cubes extracts a high‑density mesh whose vertex positions have sub‑millimeter fidelity.

Optimization strategy – The pipeline follows a coarse‑to‑fine schedule: the 3DMM parameters are fixed while the neural texture and lighting are iteratively updated; once the texture converges, the implicit surface is refined, and finally a joint fine‑tuning of all variables is performed. Multi‑view consistency losses ensure that the same texture and lighting explain all four frames simultaneously, reducing over‑fitting to any single view.

Experimental validation – The authors evaluate WildCap on a custom dataset of 30 subjects recorded in diverse outdoor settings (parks, streets, mixed indoor‑outdoor lighting). Quantitatively, the method reduces average geometric error by 35 % compared with state‑of‑the‑art 3DMM‑only pipelines and improves texture PSNR by roughly 4 dB. Qualitatively, assets exported to Blender and rendered under novel HDRI environments are virtually indistinguishable from the original footage; a user study reports a 92 % “photorealistic” rating. Runtime on an RTX 3090 GPU is 5–7 minutes per video, with memory consumption around 12 GB, and the final texture resolution reaches 2K.

Limitations and future work – WildCap relies on the presence of clear, unobstructed facial views; heavy occlusions or rapid motion can degrade frame selection and reconstruction quality. The current system handles only static, neutral‑expression faces; extending the framework to capture dynamic expressions or lip motion will require integrating deformation fields or temporal neural representations. Finally, the high memory footprint of the neural texture and SDF networks limits deployment on mobile devices; future research will explore lightweight architectures and on‑device inference to enable real‑time capture.

In summary, WildCap demonstrates that a carefully engineered hybrid of classical 3DMM fitting, neural texture learning, and implicit geometry refinement can bridge the gap between uncontrolled smartphone capture and studio‑grade 3D facial assets, opening new opportunities for AR/VR, digital humans, and content creation pipelines.