You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image
We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.
💡 Research Summary
The paper introduces NVB‑Face, a novel one‑stage framework that generates high‑quality novel‑view face images directly from a single degraded (blind) face photograph. Traditional pipelines for novel‑view synthesis on faces require a high‑resolution input; when the input is low‑resolution, blurry, noisy, or compressed, a two‑stage approach is usually adopted: first restore the image, then synthesize new views. This separation creates a critical dependency on the restoration quality, leading to error amplification, identity drift, and inconsistent viewpoints, while also incurring extra computational cost.
NVB‑Face eliminates the intermediate restoration step by extracting a latent feature map from the low‑quality image using a time‑aware image encoder (Eₙc). Instead of the standard text‑conditioned Stable Diffusion (SD) model, the authors replace the CLIP text encoder with this image encoder and fine‑tune the cross‑attention layers together with the rest of the diffusion U‑Net via LoRA, preserving the generative power of SD while adapting it to the restoration‑plus‑view‑synthesis task.
The core novelty lies in a Transformer‑based 3D Feature Construction module (T₍rans₎). T₍rans₎ receives the single‑view feature F₍ref₎, a predicted camera pose C₍in₎ (obtained by a lightweight Camera Predictor), and the diffusion time step embedding, and builds a 3‑D latent feature volume V₍out₎ that encodes multi‑view information in a spatially coherent manner. By projecting V₍out₎ with any target camera parameters Cᵢ, the model directly yields new‑view latent features Fᵢ₍out₎, which are fed back into the SD model through cross‑attention to synthesize the final high‑resolution image for that viewpoint.
Training proceeds in two phases. Phase 1 focuses solely on blind face restoration: the encoder, cross‑attention, and the SD backbone are trained on large high‑quality face datasets (e.g., FFHQ) and multi‑view datasets, using the standard diffusion loss (noise prediction) together with perceptual and adversarial terms. Phase 2 freezes all previously trained components and updates only T₍rans₎ and the Camera Predictor, employing a combination of 3‑D reconstruction loss, view‑consistency loss, and camera pose supervision to ensure accurate pose‑conditioned feature generation.
Extensive experiments compare NVB‑Face against representative two‑stage pipelines such as CodeFormer + PanoHead‑PTI and diffusion‑based methods that first restore then render. Quantitative metrics (ID‑Score, LPIPS, FID, PSNR) show improvements of 12‑18 % over baselines, with particularly strong gains in identity preservation and multi‑view consistency. Qualitative results demonstrate sharper facial details (e.g., eye reflections, skin texture) and stable appearance across a wide range of yaw/pitch angles. Ablation studies confirm the importance of image‑conditioned cross‑attention, the 3‑D latent grid, and the dedicated camera predictor.
The authors acknowledge limitations: (1) camera pose prediction can be noisy for severely degraded inputs, potentially corrupting the 3‑D volume; (2) extremely low resolutions (≤16 × 16) hinder reliable feature extraction; (3) the current design is specialized for faces, and extending to general objects or complex backgrounds would require additional research. Future directions include multi‑scale 3‑D latent representations, more robust pose estimation, and broader domain generalization.
In summary, NVB‑Face validates the claim that “you only need one stage” for novel‑view synthesis from blind face images. By integrating blind restoration and view transformation within a single diffusion‑based pipeline, it removes error propagation, reduces inference time, and opens practical applications for low‑quality real‑world imagery such as surveillance footage or mobile captures.
Comments & Academic Discussion
Loading comments...
Leave a Comment