GaussianOcc3D: A Gaussian-Based Adaptive Multi-modal 3D Occupancy Prediction
3D semantic occupancy prediction is a pivotal task in autonomous driving, providing a dense and fine-grained understanding of the surrounding environment, yet single-modality methods face trade-offs between camera semantics and LiDAR geometry. Existing multi-modal frameworks often struggle with modality heterogeneity, spatial misalignment, and the representation crisis–where voxels are computationally heavy and BEV alternatives are lossy. We present GaussianOcc3D, a multi-modal framework bridging camera and LiDAR through a memory-efficient, continuous 3D Gaussian representation. We introduce four modules: (1) LiDAR Depth Feature Aggregation (LDFA), using depth-wise deformable sampling to lift sparse signals onto Gaussian primitives; (2) Entropy-Based Feature Smoothing (EBFS) to mitigate domain noise; (3) Adaptive Camera-LiDAR Fusion (ACLF) with uncertainty-aware reweighting for sensor reliability; and (4) a Gauss-Mamba Head leveraging Selective State Space Models for global context with linear complexity. Evaluations on Occ3D, SurroundOcc, and SemanticKITTI benchmarks demonstrate state-of-the-art performance, achieving mIoU scores of 49.4%, 28.9%, and 25.2% respectively. GaussianOcc3D exhibits superior robustness across challenging rainy and nighttime conditions.
💡 Research Summary
GaussianOcc3D tackles the challenging problem of dense 3D semantic occupancy prediction for autonomous driving by unifying camera and LiDAR information within a continuous, memory‑efficient Gaussian mixture representation. Traditional voxel‑based methods suffer from cubic memory and compute costs because they must process large volumes of empty space, while bird‑eye‑view (BEV) approaches lose vertical detail. Recent advances in 3D Gaussian splatting have shown that anisotropic Gaussians can concentrate resources on object surfaces, but prior work has been either camera‑only or has not fully exploited LiDAR geometry.
The proposed framework first extracts dense image features with a 2‑D backbone and sparse geometric features with a 3‑D sparse encoder. A set of learnable Gaussian primitives (each defined by mean m, rotation r, scale s, opacity σ, and class logits c) is initialized as spatial queries. Four novel modules process and fuse the multi‑modal signals before the final occupancy map is rendered via a Gaussian‑to‑voxel splatting operation.
-
LiDAR Depth Feature Aggregation (LDFA) lifts sparse LiDAR points into the Gaussian space. Depth‑wise deformable sampling generates K learnable offsets around each Gaussian anchor; the offsets are projected onto depth planes and bilinearly interpolated to produce weighted feature aggregates. To mitigate depth sparsity, the depth dimension is partitioned into K chunks; stochastic depth and a cross‑depth attention mechanism randomly permute chunk order during training, encouraging the network to be robust to missing planes.
-
Entropy‑Based Feature Smoothing (EBFS) addresses distributional misalignment between the dense visual features and the sparse geometric features. Both feature sets are converted to probability distributions using temperature‑scaled softmax, and bidirectional cross‑entropy (camera→LiDAR and LiDAR→camera) is computed. The resulting entropy maps are transformed into exponential decay weights that are added back to the original features via a residual connection with a learnable scaling factor. Randomly selecting a subset of layers for smoothing each iteration prevents over‑reliance on the module and acts as a regularizer.
-
Adaptive Camera‑LiDAR Fusion (ACLF) performs cross‑attention refinement for each modality, then learns a soft gating mask through an MLP to dynamically weight the LiDAR‑dominant and camera‑dominant streams. A consistency‑aware reweighting stage computes cosine similarity between the projected latent spaces; low similarity triggers a channel‑wise suppression gate, effectively filtering out hallucinated visual cues or noisy LiDAR returns (e.g., multipath reflections). This dynamic weighting enables the system to adapt to varying sensor reliability, such as low‑light conditions where LiDAR geometry is more trustworthy.
-
Gauss‑Mamba Head captures global context among all Gaussians. By ordering the 3‑D primitives into a 1‑D sequence and adding positional encodings derived from their means, the model feeds the sequence into a Selective State‑Space Model (Mamba). This architecture provides linear‑time long‑range dependency modeling, avoiding the quadratic cost of traditional Transformers while still delivering the global reasoning needed for coherent occupancy maps.
Training optimizes a combined loss: cross‑entropy for class discrimination and Lovász‑softmax for IoU‑direct optimization, weighted by λ coefficients. Experiments on three large‑scale benchmarks—Occ3D, SurroundOcc, and SemanticKITTI—show state‑of‑the‑art performance: 49.4 % mIoU on Occ3D, 28.9 % on SurroundOcc, and 25.2 % on SemanticKITTI, each surpassing prior methods by several percentage points. Notably, under adverse weather (rain) and nighttime scenarios, GaussianOcc3D maintains a 5‑7 %p advantage, demonstrating its robustness. Ablation studies confirm that each module contributes meaningfully (LDFA +3.2 %p, EBFS +1.8 %p, ACLF +2.5 %p, Gauss‑Mamba +1.4 %p).
The paper also discusses limitations: the number of Gaussians influences initialization cost, calibration errors can still affect performance, and integration with downstream SLAM or planning pipelines remains to be explored. Future work may focus on adaptive Gaussian sampling, online sensor calibration, and end‑to‑end system‑level evaluations.
In summary, GaussianOcc3D introduces a principled, efficient, and adaptable multi‑modal occupancy prediction pipeline that leverages continuous Gaussian representations, entropy‑guided smoothing, uncertainty‑aware fusion, and linear‑complexity global modeling, setting a new benchmark for 3D scene understanding in autonomous driving.
Comments & Academic Discussion
Loading comments...
Leave a Comment