From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.

💡 Research Summary

SuperHead addresses the problem of enhancing low‑resolution, animatable 3D head avatars—commonly produced from consumer‑grade video or image captures—by generating high‑fidelity geometry and textures while preserving identity across motions. The authors observe that existing 2D image super‑resolution, video super‑resolution, and static 3D super‑resolution methods either introduce temporal flicker, fail to enforce multi‑view consistency, or cannot handle dynamic deformations required for talking heads.

The proposed framework leverages the rich priors of a pre‑trained 3D‑aware generative adversarial network (3D‑GAN), specifically GSGAN, to perform a dynamics‑aware 3D inversion that directly upsamples the avatar in 3D space. The pipeline consists of three main stages:

Multi‑view 3D GAN Inversion – From the low‑resolution (LR) avatar, a set of neutral‑expression renderings is sampled uniformly around the head. Each rendering is upscaled using an off‑the‑shelf 2‑D super‑resolution model, producing high‑resolution (HR) images and corresponding depth maps. The latent code (w) in the GAN’s (W^+) space and the generator weights (g) are jointly optimized to minimize a combined loss: pixel‑wise (L_2) reconstruction, perceptual LPIPS, and a depth consistency term. This forces the generated 3D volume to match both appearance and geometry across multiple viewpoints.
3D Gaussian Splatting (3DGS) Rigging – The high‑resolution 3D‑GAN output is converted into a 3D Gaussian splatting representation. Because the original LR avatar’s underlying FLAME mesh may be geometrically inaccurate, the authors first refine the global shape parameters (\beta) by aligning 2‑D facial landmarks from the HR views with the projected 3‑D landmarks of the FLAME mesh. After obtaining a corrected mesh, each Gaussian primitive is attached to its nearest mesh face. Local coordinates are stored so that when FLAME deforms (pose or expression changes), the Gaussians follow correctly, enabling realistic animation.
Dynamics‑aware 3D Refinement – To guarantee consistency beyond the neutral view, a diverse set of anchor images covering various expressions, poses, and camera angles is collected from the LR avatar, upscaled, and fed back into the 3‑D‑GAN inversion. The same latent code and generator parameters are shared across all anchors, while the loss is accumulated over every view. This multi‑condition optimization ensures that the final avatar retains fine details (e.g., teeth, eye‑balls) and remains temporally stable under arbitrary facial motions.

The authors evaluate SuperHead on public benchmarks such as NeRSemble and INSTA. Quantitative metrics (PSNR, SSIM) show substantial improvements over baseline methods, including standard 3DGS, 2‑D/ video‑based SR pipelines, and recent 3‑D‑GAN inversion approaches. Qualitatively, the avatars exhibit sharper eye‑blink dynamics, more realistic lip movements, and reduced flickering. Importantly, the rendering speed stays within real‑time limits (30–60 fps), making the method suitable for AR/VR and telepresence applications.

Limitations noted include potential bias in the pre‑trained 3‑D‑GAN (trained on limited demographic data), reduced robustness under extreme lighting or background changes, and dependence on the FLAME parametric model for rigging. Future work is suggested to expand the 3‑D‑GAN training corpus for broader diversity, integrate illumination‑aware inversion, and explore compatibility with alternative parametric head models.

In summary, SuperHead introduces a novel dynamics‑aware 3‑D inversion technique that combines the high‑resolution generative power of 3‑D‑GANs with the efficiency of Gaussian splatting and the controllability of FLAME, delivering high‑quality, temporally consistent, and identity‑preserving talking‑head avatars from low‑quality inputs.

From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors

💡 Research Summary

Comments & Academic Discussion

Leave a Comment