Distractor-free Generalizable 3D Gaussian Splatting
We present DGGS, a novel framework that addresses the previously unexplored challenge: $\textbf{Distractor-free Generalizable 3D Gaussian Splatting}$ (3DGS). It mitigates 3D inconsistency and training instability caused by distractor data in the cross-scenes generalizable train setting while enabling feedforward inference for 3DGS and distractor masks from references in the unseen scenes. To achieve these objectives, DGGS proposes a scene-agnostic reference-based mask prediction and refinement module during the training phase, effectively eliminating the impact of distractor on training stability. Moreover, we combat distractor-induced artifacts and holes at inference time through a novel two-stage inference framework for references scoring and re-selection, complemented by a distractor pruning mechanism that further removes residual distractor 3DGS-primitive influences. Extensive feedforward experiments on the real and our synthetic data show DGGS’s reconstruction capability when dealing with novel distractor scenes. Moreover, our generalizable mask prediction even achieves an accuracy superior to existing scene-specific training methods. Homepage is https://github.com/bbbbby-99/DGGS.
💡 Research Summary
The paper introduces Distractor‑free Generalizable 3D Gaussian Splatting (DGGS), a framework that tackles the previously unaddressed problem of transient objects (distractors) in generalizable 3D Gaussian Splatting (3DGS). Existing generalizable 3DGS methods assume static scenes; when real‑world captures contain vehicles, pedestrians, balloons, etc., training becomes unstable because the reference‑query consistency is broken, and inference produces ghosting artifacts and holes. DGGS solves both issues with two complementary modules.
During training, DGGS adds a Reference‑based Mask Prediction and Refinement pipeline. An initial robust mask (M_Rob) is generated using a simple heuristic based on reconstruction error, but this mask often misclassifies difficult static regions as distractors. To correct this, the method re‑projects the current 3DGS back onto each reference view, extracts the non‑distractor regions (which are reliably rendered), and warps them to the query view. This “reference filter” removes false positives from M_Rob. The remaining uncertain pixels are filled using pre‑trained segmentation (e.g., SAM) and a disparity‑error mask, yielding a final mask M. The refined mask is applied pixel‑wise to the query loss (M ⊙ ‖I_T – G(P_T)‖²) and an auxiliary loss supervises occluded areas, ensuring that only clean static regions contribute to gradient updates. This process runs automatically each iteration, requiring no scene‑specific supervision.
For inference, DGGS employs a two‑stage strategy. First, a larger pool of candidate reference images is scored based on the predicted distractor proportion and view disparity derived from the masks. Low‑score references (i.e., those with few distractors and small baseline) are selected to reconstruct the scene. Second, after the 3DGS attributes are inferred, a Distractor Pruning network examines each Gaussian primitive and eliminates those associated with residual distractor signals, effectively removing ghost artifacts. The pruning leverages the Gaussian’s position, scale, and rotation, setting the weight of identified primitives to zero.
The authors evaluate DGGS on real outdoor/indoor datasets as well as synthetic scenes built from Re10K and ACID. Quantitatively, DGGS improves PSNR by 2–3 dB, SSIM by 0.02–0.04, and LPIPS by 5–7 % over baseline generalizable 3DGS. Mask precision/recall also surpasses existing scene‑specific distractor‑free methods, despite using no extra supervision. Qualitative results show a dramatic reduction of ghosting and holes, even in challenging indoor clutter. Ablation studies confirm that the reference‑based mask filter outperforms pure error‑threshold masks, and that the two‑stage inference yields better visual fidelity than a single‑stage pipeline.
In summary, DGGS provides a fully feed‑forward solution for distractor‑free generalizable 3D reconstruction with Gaussian splatting. By integrating reference‑driven mask prediction, refinement, and a two‑stage inference with distractor pruning, it achieves stable training, superior mask accuracy, and high‑quality reconstructions without scene‑specific preprocessing. This makes the approach highly suitable for real‑time mobile applications such as AR/VR, robotics navigation, and on‑device photogrammetry.
Comments & Academic Discussion
Loading comments...
Leave a Comment