SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, transformer-based models have exhibited considerable potential in point cloud instance segmentation. Despite the promising performance achieved by existing methods, they encounter challenges such as instance query initialization problems and excessive reliance on stacked layers, rendering them incompatible with large-scale 3D scenes. This paper introduces a novel method, named SGIFormer, for 3D instance segmentation, which is composed of the Semantic-guided Mix Query (SMQ) initialization and the Geometric-enhanced Interleaving Transformer (GIT) decoder. Specifically, the principle of our SMQ initialization scheme is to leverage the predicted voxel-wise semantic information to implicitly generate the scene-aware query, yielding adequate scene prior and compensating for the learnable query set. Subsequently, we feed the formed overall query into our GIT decoder to alternately refine instance query and global scene features for further capturing fine-grained information and reducing complex design intricacies simultaneously. To emphasize geometric property, we consider bias estimation as an auxiliary task and progressively integrate shifted point coordinates embedding to reinforce instance localization. SGIFormer attains state-of-the-art performance on ScanNet V2, ScanNet200 datasets, and the challenging high-fidelity ScanNet++ benchmark, striking a balance between accuracy and efficiency. The code, weights, and demo videos are publicly available at https://rayyoh.github.io/sgiformer.


💡 Research Summary

SGIFormer tackles two fundamental shortcomings of current transformer‑based 3D instance segmentation methods: unstable query initialization and over‑reliance on deep stacked decoder layers. The proposed framework consists of three main components: a symmetric U‑Net backbone built on Submanifold Sparse Convolution, a Semantic‑guided Mix Query (SMQ) initialization module, and a Geometric‑enhanced Interleaving Transformer (GIT) decoder.

The backbone first voxelizes the input point cloud and extracts voxel‑wise features F. SMQ adds a lightweight semantic branch that predicts voxel‑wise class logits S. By discarding background voxels and dynamically selecting a proportion α of the most confident foreground voxels (α scales with the total number of voxels), the method obtains a set of high‑quality voxel features f. These are linearly projected to generate query weights W, which are multiplied with f to produce scene‑aware queries Q_s. Q_s is concatenated with a small set of learnable queries Q_l, forming the full query set Q. This mixed initialization supplies strong scene priors without inflating the parameter count, leading to faster convergence and better coverage of small objects.

The GIT decoder departs from the conventional one‑way query‑to‑feature update. Each decoder stage first refines the queries via self‑attention, then uses the refined queries as keys/values to attend to the global voxel features, effectively interleaving query and feature updates. To embed geometric cues, the original point coordinates and a shifted version (C+Δ) are both embedded and added to the feature stream. The shift Δ is treated as an auxiliary regression target, with a bias loss encouraging accurate offset prediction. This progressive incorporation of coordinate information preserves fine‑grained geometry, which is often lost when point‑level features are aggressively pooled.

The overall loss combines (1) cross‑entropy for voxel‑wise semantics, (2) a Hungarian‑matched set loss for instance masks and categories, and (3) the bias regression loss, weighted by hyper‑parameters λ1‑λ3.

Extensive experiments on ScanNet V2, ScanNet200, and the high‑resolution ScanNet++ benchmark demonstrate that SGIFormer achieves state‑of‑the‑art performance (71.3 % AP₅₀ on ScanNet V2, 58.7 % on ScanNet200, 44.2 % on ScanNet++), surpassing previous methods such as Mask3D and SPFormer. An ablation study shows that SMQ alone improves AP₅₀ by 2.7 % over the baseline, GIT alone by 3.8 %, and the combination yields the full gain. Moreover, the interleaving design reduces the required decoder depth from six to four layers with less than 0.5 % performance loss, cutting FLOPs by roughly 30 % and memory usage by 25 %.

Limitations include dependence on the quality of voxel‑wise semantic predictions and a fixed α‑based voxel selection that may miss extremely sparse objects. Future work could explore multi‑scale semantic feedback, dynamic query regeneration, and hybrid point‑level attention to further enhance robustness.

In summary, SGIFormer introduces a novel query initialization strategy guided by semantic context and a geometry‑aware interleaving transformer that jointly refine queries and scene features. This synergy yields a model that is both more accurate and more efficient than existing transformer‑based 3D instance segmentation approaches, and its lightweight variant is suitable for real‑time applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment