Splatent: Splatting Diffusion Latents for Novel View Synthesis
Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.
💡 Research Summary
The paper “Splatent: Splatting Diffusion Latents for Novel View Synthesis” addresses a critical bottleneck in modern 3D reconstruction pipelines that utilize the latent space of Variational Autoencoders (VAEs) from diffusion models. While leveraging VAE latents for radiance field representations offers significant advantages in rendering efficiency and seamless integration with generative diffusion pipelines, it suffers from a fundamental flaw: the VCAE latent space is not inherently multi-view consistent. This lack of consistency manifests as blurred textures and the loss of fine-grained details during the 3D reconstruction process.
Existing methodologies have attempted to mitigate this issue through two primary routes: fine-tuning the VAE to enforce consistency, which often compromises the original reconstruction quality, or employing pre-trained diffusion models to hallucinate missing details, which introduces the risk of generating inaccurate or non-existent features (hallucinations).
To overcome these limitations, the authors introduce Splatent, a novel diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) within the VAE latent space. The core innovation of Splatent lies in its departure from the traditional 3D-centric reconstruction paradigm. Instead of attempting to reconstruct fine-grained details directly within the 3D space—a task that is computationally expensive and prone to errors—Splatent focuses on recovering these details in the 2D domain using multi-view attention mechanisms applied to the input views.
By utilizing multi-view attention, Splatent can effectively leverage information across different perspectives to refine the latent representations. This approach allows the framework to preserve the high-fidelity reconstruction capabilities of the pre-trained VAE while simultaneously achieving faithful detail recovery. Experimental evaluations across multiple benchmarks demonstrate that Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. Furthermore, the authors show that integrating Splatent into existing feed-forward frameworks consistently improves detail preservation, offering a robust solution for high-quality 3D reconstruction from sparse input views. This work paves the way for more reliable and detailed 3D generative modeling by bridging the gap between efficient 3DGS-based representations and powerful 2D diffusion-based refinement.
Comments & Academic Discussion
Loading comments...
Leave a Comment