SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors
Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively under sparse input conditions, resulting in noticeable artifacts. To this end, we propose SparseGS-W, a novel framework based on 3D Gaussian Splatting that enables the reconstruction of complex outdoor scenes and handles occlusions and appearance changes with as few as five training images. We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input. Specifically, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively improve the quality of rendered novel views during the Gaussian optimization process. Furthermore, we propose an Occlusion Handling module, which flexibly removes occlusions utilizing the inherent high-quality inpainting capability of constrained diffusion priors. Both modules are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments on the PhotoTourism and Tanks and Temples datasets demonstrate that SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.
💡 Research Summary
SparseGS‑W tackles the long‑standing problem of few‑shot novel view synthesis (NVS) for unconstrained, in‑the‑wild photo collections. While recent breakthroughs such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive rendering quality, they rely on dense multi‑view inputs (often > 1,000 images) and assume static scenes. Real‑world tourist photo sets, however, contain only a handful of images captured at different times, under varying illumination, and frequently include transient occluders (people, vehicles, foliage). Existing NeRF‑ or 3DGS‑based extensions (e.g., GS‑W, WildGaussians, CoherentGS) still suffer from severe artifacts when the number of training views drops to five or ten.
Core Idea
SparseGS‑W fuses two complementary priors to compensate for the lack of multi‑view information:
-
Geometric Prior – A pre‑trained multi‑view stereo network (DUSt3R) is used to obtain a dense point cloud and accurate camera poses from the sparse input set. This provides a reliable initial 3D scaffold even when only a few images are available.
-
Constrained Diffusion Prior – A Stable Diffusion model is fine‑tuned on the available training views, yielding a “constrained” diffusion network (ε*θ) that restricts generation to a clean sub‑space aligned with the scene’s content.
Constrained Novel‑View Enhancement (CNVE)
For each rendered novel view Iₙ (produced by the current 3D Gaussian field), DDIM inversion maps the image to a latent x_T in Gaussian noise space. Two reverse diffusion branches are then run in parallel:
- Reconstruction branch – uses the original diffusion εθ to denoise x_T back to the original rendered image, preserving the exact geometry.
- Enhancement branch – uses the constrained diffusion εθ to generate a high‑quality version x₀.
To avoid structural drift, self‑attention query (Q) and key (K) tensors from the reconstruction branch are injected into the enhancement branch (Q_e←Q_r, K_e←K_r). This attention injection forces the enhanced image to retain the spatial layout of the original render while benefiting from the diffusion model’s denoising power. After T denoising steps, AdaIN is applied to blend the appearance of a user‑provided reference image, producing a pseudo‑ground‑truth I_pgtₙ. The loss L = λ₁‖Iₙ−I_pgtₙ‖₁ + λ₂·SSIM(Iₙ, I_pgtₙ) supervises the 3D Gaussian parameters (position, covariance, color SH, opacity).
Occlusion Handling (OH)
Transient objects are addressed by first generating binary masks M_i with the EVF‑SAM segmentation model, driven by a textual prompt (e.g., “remove people”). The masked regions are treated as a separate diffusion problem: the latent for the masked area is taken from the original image, while the unmasked area follows the reconstruction latent. A fused latent x*ₜ = M_i⊙x′ₜ + (1−M_i)⊙x_gtₜ is then processed through the same dual‑branch diffusion with attention injection, yielding an occlusion‑free, high‑fidelity image after AdaIN. This image becomes the pseudo‑ground‑truth for the occlusion‑handling loss.
Progressive Sampling and Training Strategy (PSTS)
To prevent over‑fitting to the few training views, SparseGS‑W augments the view set by spherical linear interpolation (SLERP) between existing camera poses and by adding small Gaussian perturbations to both position and orientation. The resulting synthetic poses C′ are rendered, enhanced by CNVE, and optionally cleaned by OH, providing a large pool of pseudo‑GT images that drive the progressive refinement of the Gaussian field. Early training focuses on coarse geometry; later stages gradually increase the number of Gaussians and the resolution of SH coefficients, achieving fine‑grained detail without sacrificing real‑time rendering speed.
Experimental Validation
The method is evaluated on two large‑scale benchmarks: PhotoTourism (tourist landmarks) and Tanks & Temples (heritage structures). Using only 5–10 images per scene, SparseGS‑W outperforms the current state‑of‑the‑art (GS‑W, WildGaussians, CoherentGS, etc.) across both full‑reference metrics (PSNR, SSIM, LPIPS) and no‑reference perceptual scores (FID, ClipIQA, MUSIQ). Qualitatively, the reconstructed scenes exhibit consistent lighting, sharp edges, and clean backgrounds even when the input set contains moving people or vehicles.
Key Contributions
- First few‑shot NVS framework for wild photo collections – demonstrates that high‑quality 3DGS can be achieved with as few as five images.
- Plug‑and‑play Constrained Novel‑View Enhancement and Occlusion Handling modules – leverage constrained diffusion priors and attention injection to improve view quality and remove transient occluders without extra appearance‑extraction networks.
- AdaIN‑based appearance control – enables users to impose the color/illumination style of any reference image on the reconstructed scene.
- Progressive sampling and training strategy – efficiently generates synthetic views for supervision, ensuring stable convergence despite extreme data sparsity.
Implications and Future Work
SparseGS‑W opens the door to practical AR/VR content creation from personal photo albums, rapid cultural‑heritage digitization where only a handful of pictures are available, and on‑board scene reconstruction for robots or drones with limited sensing bandwidth. Remaining challenges include handling truly dynamic elements (e.g., water, fire), reducing the computational cost of diffusion fine‑tuning, and scaling to city‑scale point clouds. Future research may explore lightweight diffusion priors, dynamic scene extensions, and memory‑efficient Gaussian representations.
In summary, SparseGS‑W combines geometric reconstruction, constrained generative diffusion, and progressive training to deliver state‑of‑the‑art few‑shot novel view synthesis for unconstrained outdoor scenes, achieving both quantitative superiority and visually compelling results.
Comments & Academic Discussion
Loading comments...
Leave a Comment