LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at https://github.com/RayYoh/LaSSM.

💡 Research Summary

LaSSM (Local Aggregation and State Space Models) tackles two fundamental bottlenecks in query‑based 3D instance segmentation of point clouds: (1) the difficulty of initializing a compact yet comprehensive set of queries from highly sparse data, and (2) the quadratic computational cost of transformer‑style cross‑attention in the decoder.

The pipeline begins with a sparse 3D U‑Net backbone that extracts voxel‑wise features from the raw point cloud. These features are pooled into superpoints, which serve as a higher‑level representation of the scene. A lightweight semantic classifier predicts class probabilities for each superpoint. By taking the maximum non‑background confidence per superpoint, the method ranks superpoints and selects the top‑r · S of them (where S is the total number of superpoints and r is a dynamic ratio that adapts to scene complexity). This step ensures that queries are seeded in semantically promising regions while preserving a global spatial spread. To avoid duplicate or overly clustered queries, farthest‑point sampling (FPS) is applied again on the selected superpoints, yielding a fixed number q of query embeddings. The embeddings are generated by a simple MLP projection, and the corresponding 3D coordinates are stored as query positions Qc.

The decoder consists of L identical layers, each comprising three modules: (i) Local Aggregation, (ii) Spatial Dual‑Path State Space Model (SSM), and (iii) Center Regression. Local aggregation gathers k‑nearest‑neighbor superpoint features around each query coordinate, weighting them with a learnable matrix K to produce an aggregated query representation Qagg. This operation injects geometrically coherent context into each query without resorting to global attention, dramatically reducing redundant computation.

The core of the decoder is the spatial dual‑path SSM block. Queries and their coordinates are sorted according to a Hilbert space‑filling curve, converting the unordered set into two 1‑D sequences: one for content (Q) and one for positional information (Qc). Each sequence passes through a state‑space model, which is a discretized linear dynamical system (hₖ = Ahₖ₋₁ + Bxₖ, yₖ = Chₖ). By leveraging the convolutional kernel formulation of SSMs, the block captures long‑range dependencies across the entire query set with linear O(q) complexity, effectively replacing the expensive multi‑head attention. The two pathways are merged, producing refined query embeddings that now encode both semantic content and spatial relationships.

A lightweight center regression head then predicts an offset for each query’s coordinate, updating Qc for the next decoder layer. This iterative refinement aligns the query positions with the evolving content, enabling progressive improvement of instance masks and bounding boxes.

Because every decoder layer operates with linear complexity (local aggregation O(q·k) and SSM O(q·L)), LaSSM reduces FLOPs to roughly one‑third of the best transformer‑based baselines while maintaining or improving accuracy. Empirical results on the ScanNet++ V2 leaderboard show a 2.5 % absolute gain in mean AP and a 2.3 % gain in AP₅₀ over the previous state‑of‑the‑art, with only 1/3 of the FLOPs. Comparable gains are reported on ScanNet V2, ScanNet200, S3DIS, and ScanNet++ V1, together with a 0.64× reduction in GPU memory consumption.

Ablation studies confirm the contribution of each component: (a) removing the semantic‑spatial initializer and relying solely on FPS drops mAP by 3–4 %; (b) omitting local aggregation increases computation and reduces mAP by ~1.5 %; (c) replacing the dual‑path SSM with a plain MLP loses long‑range query interactions, decreasing AP₅₀ by ~2 %.

In summary, LaSSM introduces a principled query initialization that jointly exploits semantic confidence and spatial distribution, and a novel decoder that fuses local geometric aggregation with a coordinate‑aware state‑space model. This combination yields a highly efficient yet accurate solution for large‑scale 3D scene instance segmentation, opening avenues for real‑time robotics and AR/VR applications where computational resources are limited.

LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment