Split&Splat: Zero-Shot Panoptic Segmentation via Explicit Instance Modeling and 3D Gaussian Splatting
3D Gaussian Splatting (GS) enables fast and high-quality scene reconstruction, but it lacks an object-consistent and semantically aware structure. We propose Split&Splat, a framework for panoptic scene reconstruction using 3DGS. Our approach explicitly models object instances. It first propagates instance masks across views using depth, thus producing view-consistent 2D masks. Each object is then reconstructed independently and merged back into the scene while refining its boundaries. Finally, instance-level semantic descriptors are embedded in the reconstructed objects, supporting various applications, including panoptic segmentation, object retrieval, and 3D editing. Unlike existing methods, Split&Splat tackles the problem by first segmenting the scene and then reconstructing each object individually. This design naturally supports downstream tasks and allows Split&Splat to achieve state-of-the-art performance on the ScanNetv2 segmentation benchmark.
💡 Research Summary
Split&Splat introduces a two‑stage framework that endows fast 3D Gaussian Splatting (3DGS) with explicit instance awareness and semantic descriptors, enabling zero‑shot panoptic segmentation and a suite of downstream tasks. In the “Split” stage, multi‑view images are first processed by a state‑of‑the‑art 2D instance segmenter (e.g., SAM2) to obtain per‑view binary masks. Simultaneously, Structure‑from‑Motion (COLMAP) yields camera poses and a sparse point cloud, while a monocular depth estimator provides per‑view depth maps. The depth and pose information are used to project each mask onto the emerging dense point cloud, enforcing geometric consistency across views. Points are filtered with DBSCAN to remove outliers, and a voting scheme aggregates per‑view label scores into a final 3D label for each point. Re‑projecting this globally labeled point cloud back onto each view produces view‑consistent masks with globally unique instance IDs.
The “Splat” stage takes these refined masks and reconstructs each object independently using 3DGS. For every instance, the corresponding masked multi‑view images are fed to the Gaussian splatting optimizer, which learns positions, colors, covariances, and opacities for a set of Gaussians representing that object alone. Because objects are isolated, inter‑object Gaussian overlap is minimized, leading to sharper boundaries and reduced memory consumption (only one descriptor per instance is needed). After reconstruction, the method renders each instance from every view with full opacity to obtain a rendered mask (M_gs). A K‑means++‑style 2D sampling selects uniformly distributed points on the projected Gaussians; these points are used to prompt the 2D segmenter again, producing a refined mask (M_sam). The pipeline compares M_sam with the propagated mask (˜M) using IoU against M_gs, retaining the mask with higher overlap (threshold τ_iou = 0.95). This dual‑mask verification dramatically improves segmentation quality for small, occluded, or visually ambiguous objects.
Finally, all per‑instance Gaussian sets are merged into a global scene. Axis‑aligned 3D bounding boxes are computed for each instance, and a collision matrix quantifies spatial overlap. Non‑overlapping instances are simply concatenated; overlapping ones undergo boundary refinement to preserve object edges while avoiding duplicate geometry.
Beyond geometry, Split&Splat attaches a semantic descriptor to each instance. The descriptor is extracted from a set of masked views using a visual encoder (e.g., CLIP), yielding a compact embedding that can be used for zero‑shot object retrieval, text‑guided editing, and panoptic labeling without any task‑specific fine‑tuning.
The authors evaluate on ScanNetv2, achieving a mean IoU of 71.3 % for panoptic segmentation—substantially outperforming prior 3DGS‑based methods that fuse 2D semantics directly into the Gaussian representation. Ablation studies confirm that (1) depth‑driven mask propagation dramatically reduces cross‑view inconsistencies, (2) independent instance reconstruction yields sharper boundaries and lower memory footprints, and (3) the instance‑level descriptor enables zero‑shot retrieval and editing with negligible overhead.
In summary, Split&Splat makes three core contributions: (i) a depth‑and‑SfM‑guided multi‑view mask consistency algorithm that produces globally coherent instance masks, (ii) a split‑then‑reconstruct pipeline that isolates objects for high‑fidelity 3DGS and efficient memory usage, and (iii) the integration of per‑instance semantic embeddings that transform 3DGS from a pure rendering tool into a versatile, semantically aware 3D scene representation. This work opens the door to real‑time, edit‑ready 3D environments for robotics, AR/VR, and interactive content creation.
Comments & Academic Discussion
Loading comments...
Leave a Comment