WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction

WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

3D Gaussian Splatting (3DGS) has become a powerful representation for image-based object reconstruction, yet its performance drops sharply in sparse-view settings. Prior works address this limitation by employing diffusion models to repair corrupted renders, subsequently using them as pseudo ground truths for later optimization. While effective, such approaches incur heavy computation from the diffusion fine-tuning and repair steps. We present WaveletGaussian, a framework for more efficient sparse-view 3D Gaussian object reconstruction. Our key idea is to shift diffusion into the wavelet domain: diffusion is applied only to the low-resolution LL subband, while high-frequency subbands are refined with a lightweight network. We further propose an efficient online random masking strategy to curate training pairs for diffusion fine-tuning, replacing the commonly used, but inefficient, leave-one-out strategy. Experiments across two benchmark datasets, Mip-NeRF 360 and OmniObject3D, show WaveletGaussian achieves competitive rendering quality while substantially reducing training time.


💡 Research Summary

WaveletGaussian tackles the well‑known degradation of 3D Gaussian Splatting (3DGS) when only a few input views are available. Existing sparse‑view solutions typically rely on large pre‑trained denoising diffusion models (DDMs) that are fine‑tuned per‑scene, used to inpaint heavily corrupted novel‑view renders, and then treat the repaired images as pseudo‑ground‑truth for a second round of 3DGS optimization. While effective, this pipeline is computationally heavy: diffusion must run on full‑resolution RGB images, and each scene requires its own fine‑tuning, leading to overall training times of up to an hour.

The core contribution of WaveletGaussian is two‑fold. First, it moves the diffusion process from the RGB domain into the wavelet domain. By applying a discrete Haar wavelet transform to every render, the image is split into a low‑frequency (LL) sub‑band and three high‑frequency sub‑bands (LH, HL, HH). The LL component retains the coarse structure and color information at one‑quarter of the original spatial resolution, while the high‑frequency bands encode fine textures and edges. Diffusion is then performed only on the LL sub‑band using a pre‑trained ControlNet‑style diffusion model (denoted D). Because the LL map is four times smaller in pixel count, the diffusion step becomes dramatically cheaper while still correcting global color and illumination errors. The high‑frequency bands are repaired by a lightweight U‑Net‑like network (denoted U) that learns to restore missing details without the heavy cost of diffusion. Training D and U separately disentangles low‑ and high‑frequency learning, allowing each network to specialize and avoid interference.

Second, WaveletGaussian replaces the commonly used leave‑one‑out (LOO) data‑generation strategy with an online random masking (ORM) scheme. LOO requires training N separate 3DGS models, each missing a different view, to synthesize corrupted‑clean image pairs for diffusion fine‑tuning—an approach that scales linearly with the number of views. ORM instead trains a single auxiliary 3DGS model (G_d) on all views while applying a randomly generated binary mask M to each ground‑truth render. The mask consists of several rectangular regions that drift sinusoidally over training iterations, producing a diverse set of corruption patterns from a single model. These masked renders are paired with the original clean renders, both transformed into the wavelet domain. Only the LL sub‑band pairs are used to fine‑tune D, while the high‑frequency sub‑bands from the coarse model G_c are paired with clean high‑frequency references to train U. Consequently, the entire dataset for diffusion fine‑tuning is built with one auxiliary model, cutting the data‑generation cost by a factor of 2–3.

The overall pipeline consists of four stages:

  1. Coarse Training – A 3DGS model G_c is briefly trained on all sparse views, yielding a rough geometry but noticeably corrupted renders.
  2. Dataset Creation – Using ORM, corrupted‑clean pairs are generated in the wavelet domain. The LL pairs feed diffusion fine‑tuning; the high‑frequency pairs feed U‑Net training.
  3. Diffusion Fine‑Tuning – D becomes an LL‑domain inpainting model, while U learns to restore high‑frequency details. Both operate at half resolution, dramatically reducing GPU memory and compute compared to full‑resolution diffusion.
  4. Fine Training – The coarse model G_c is refined into G_f. During each iteration, D (frozen) repairs the LL component of a novel‑view render, U (frozen) repairs the high‑frequency components, and the inverse wavelet transform reconstructs a full‑RGB pseudo‑reference. These pseudo‑references are then used as supervision for the final 3DGS optimization.

Experiments on two benchmarks—Mip‑NeRF 360 and OmniObject3D—under a 4‑view setting demonstrate that WaveletGaussian matches or slightly exceeds the state‑of‑the‑art GaussianObject method in all visual quality metrics (PSNR, SSIM, LPIPS) while reducing total training time from 51–55 minutes to 33–35 minutes (≈30–40 % speed‑up). An ablation study isolates the contributions: (a) ORM alone cuts training time by ~8–10 minutes compared to LOO without harming quality; (b) using LL‑only diffusion further reduces time but slightly lowers PSNR; (c) adding the high‑frequency U‑Net restores the lost detail and yields the best overall performance.

In summary, WaveletGaussian introduces a novel “wavelet‑domain diffusion” paradigm for sparse‑view 3D reconstruction. By confining the expensive diffusion step to a low‑resolution frequency band and handling fine details with a lightweight network, it achieves a favorable trade‑off between computational efficiency and rendering fidelity. The online random masking strategy further streamlines dataset preparation, making the approach practical for real‑world scenarios where only a handful of views are available (e.g., robotics, AR/VR, rapid prototyping). Future directions include extending to multi‑level wavelet decompositions, handling non‑square or irregular view distributions, and jointly learning texture‑lighting disentanglement for even higher‑quality reconstructions.


Comments & Academic Discussion

Loading comments...

Leave a Comment