ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection
Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.
💡 Research Summary
The paper introduces ALIGN, a novel query‑initialization framework designed to improve query‑based 3D object detectors, especially under occlusion and crowded scenes. Existing methods either sample queries randomly across the whole scene or rely on BEV heatmaps to select salient regions. Random sampling lacks object awareness, causing many queries to fall on background and slowing convergence. Heatmap‑based sampling focuses on high‑response areas but often misses small, distant, or heavily occluded objects whose heatmap scores are low.
ALIGN addresses these shortcomings with three complementary modules.
-
Occlusion‑aware Center Estimation (OCE): LiDAR points are projected onto multi‑view image segmentation masks, linking each point to a visible instance. For each detected 2D bounding box, the four LiDAR points closest to the box center are selected, and a homography matrix is estimated to map image coordinates back to 3D space, yielding an approximate surface point (p_surf). Since the surface point is biased toward the visible side, a class‑specific depth offset is applied along the LiDAR ray to obtain a more accurate 3D object center (P_OCE). This process leverages both geometry and semantics, allowing reliable center estimation even when only a small portion of the object is visible.
-
Adaptive Neighbor Sampling (ANS): To compensate for cases where OCE cannot produce reliable centers (e.g., severe occlusion or missing masks), ANS clusters the LiDAR point cloud using DBSCAN. Each cluster’s core point (p_cluster) becomes an anchor, and a set of N neighbor points is randomly sampled within a radius r around the core. Crucially, only points whose projections fall within a predefined offset of the corresponding segmentation mask are retained, filtering out irrelevant background points such as ground or walls. If the retained set is too small, the sampling is repeated up to three times. This module enriches the query set with semantically consistent points surrounding each object, improving spatial coverage.
-
Dynamic Query Balancing (DQB): After allocating queries to OCE centers and ANS cluster cores, the remaining query budget is split between additional neighbor queries (N_ANS) and background random queries (N_rand) based on a balancing factor ∇_bal ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment