Projected Representation Conditioning for High-fidelity Novel View Synthesis
We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.
💡 Research Summary
The paper introduces ReNoV (Representation‑guided Novel View synthesis), a diffusion‑based framework that leverages external visual representations as conditioning signals to achieve high‑fidelity novel view synthesis with strong geometric consistency and inpainting quality. The authors begin by analyzing several state‑of‑the‑art visual foundation models—VGGT, DepthAnything‑V3, and DINOv2—focusing on their ability to encode cross‑view geometric correspondence and semantic awareness across network layers. Using layer‑wise PCK (Percentage of Correct Keypoints), semantic similarity, and LDS (Local‑vs‑Distant Similarity) metrics, they demonstrate that deeper layers of VGGT and DepthAnything‑V3 capture reliable geometric structure even in repetitive scenes, whereas DINOv2 excels at semantic similarity but lacks precise geometric alignment.
Motivated by these findings, the authors propose a two‑stage projection conditioning pipeline. First, a Representation Projection Module takes the external features extracted from each reference image, warps them into a 3D point cloud using depth maps and estimated camera poses (provided by the same external models), and re‑projects the point cloud onto the target viewpoint. Missing pixels caused by occlusions or unmapped regions are filled with learnable mask tokens, effectively providing an inpainting prior. Second, the warped and completed features are injected into the cross‑attention layers of a diffusion U‑Net, allowing the denoising network to attend precisely to geometrically aligned regions for reconstruction while leveraging broader semantic context for inpainting.
Extensive experiments are conducted on RealEstate10K and the DTU benchmark. On RealEstate10K, ReNoV outperforms recent diffusion‑based NVS methods such as CAT3D, ViewCrafter, and Zero123 across PSNR, SSIM, and LPIPS, especially when only one to three reference images are available. The method also shows superior extrapolation capability, maintaining consistency under large camera pose changes. In a zero‑shot setting on DTU, where camera poses are unknown, ReNoV combined with VGGT‑derived pose and depth estimates delivers more coherent 3D geometry and visual quality than feed‑forward baselines.
Ablation studies confirm the necessity of each component: removing the external representation, using only 2D feature fusion, or omitting the projection module each leads to a significant drop in performance. Notably, conditioning on DINOv2 features results in poorer reconstruction due to weaker geometric correspondence, underscoring the importance of selecting representations with strong multi‑view geometry.
In summary, the paper makes three key contributions: (1) a systematic quantitative and qualitative analysis of visual foundation model features for multi‑view correspondence, (2) the design of a projection‑based conditioning mechanism that bridges 2D diffusion models with 3D geometry derived from external representations, and (3) a demonstration that this approach enables high‑quality, geometry‑consistent novel view synthesis from sparse, unposed image collections, advancing the state of the art in diffusion‑driven view synthesis. Future work may explore lighter projection modules and real‑time interactive applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment