TFusionOcc: Student's t-Distribution Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction
3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine-grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision-making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model’s ability to efficiently and effectively capture fine-grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object-centric multi-sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi-stage multi-sensor fusion, Student’s t-distribution, and the T-Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state-of-the-art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes-C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: https://github.com/DanielMing123/TFusionOcc
💡 Research Summary
TFusionOcc presents a novel object‑centric multi‑sensor fusion framework for 3D semantic occupancy prediction that significantly advances the state of the art in both accuracy and robustness. The authors identify two major shortcomings of existing approaches: voxel‑based methods suffer from massive redundant computation over empty space, while recent Gaussian‑based object‑centric models lack sufficient geometric expressiveness and are vulnerable to outliers. To overcome these issues, TFusionOcc introduces three key technical innovations.
First, it replaces the Gaussian kernel with a Student’s t‑distribution and a T‑Mixture Model (TMM). The heavy‑tailed nature of the t‑distribution provides intrinsic robustness to noisy measurements and sensor corruptions, while the mixture formulation enables the representation of complex, multimodal occupancy patterns without a prohibitive increase in parameters.
Second, the framework expands the primitive set beyond simple 3D Gaussians. It defines a general T‑Primitive, a superquadric primitive, and a deformable superquadric (superquadric with an inverse warp). Superquadrics can smoothly approximate a wide variety of shapes—from elongated vehicles to irregular pedestrians—while the inverse warp adds non‑linear deformation capability, allowing the primitives to capture fine‑grained geometric details that are essential for accurate occupancy maps.
Third, TFusionOcc adopts a multi‑stage fusion strategy, termed MGCAFusion, which consists of early‑stage, middle‑stage, and late‑stage fusion modules. Early‑stage fusion merges dense depth maps derived from camera images with sparse depth maps projected from lidar points, producing a fused depth representation that is subsequently encoded by a lightweight MLP and added to the visual feature stream. Middle‑stage fusion performs an outer‑product between depth‑aware visual features and the fused depth map, effectively “lifting” 2D image information into a 3‑D volumetric space. Late‑stage fusion aligns the cylindrical voxel volumes from the camera and lidar branches via a skeleton‑merge module, yielding a unified voxel representation that serves as anchor points for initializing the T‑primitives.
The initialized primitives, together with the fused visual and lidar features, are processed by a stack of N transformer blocks. This transformer refines the primitive parameters (position, scale, shape, and deformation) in a data‑driven manner, ensuring that the final set of primitives faithfully encodes both geometry and semantics. A 3‑D‑to‑3‑D splatting operation then rasterizes the refined primitives into a dense occupancy grid, where each voxel is assigned one of C semantic classes (or “empty”).
The architecture is evaluated on the nuScenes dataset and its corrupted variant nuScenes‑C, which introduces realistic camera and lidar degradations (e.g., rain, fog, sensor dropout). TFusionOcc achieves the highest mean Intersection‑over‑Union (mIoU) and overall IoU among all published methods, while maintaining a competitive inference speed of approximately 12 FPS on a single RTX‑3090 GPU. Ablation studies demonstrate that replacing the t‑distribution with a Gaussian kernel reduces mIoU by roughly 2.3 %, and substituting the deformable superquadric with a simple Gaussian primitive degrades boundary fidelity, confirming the importance of both the heavy‑tailed kernel and the flexible primitive family. Moreover, the multi‑stage fusion contributes an additional 1.8 %–2.5 % performance gain over a single‑stage baseline, highlighting the benefit of progressively integrating complementary sensor cues.
Despite its strengths, TFusionOcc currently relies on six surround‑view cameras and a single lidar sensor. Extending the framework to incorporate radar, ultrasonic, or inertial measurements, and validating real‑time deployment on embedded automotive hardware, remain open research directions. Additionally, the optimization of T‑primitive parameters can be sensitive to initialization; future work may explore more stable training schemes or learned priors.
In summary, TFusionOcc delivers a compelling combination of geometric flexibility, computational efficiency, and outlier robustness for 3D semantic occupancy prediction. By unifying a heavy‑tailed probabilistic kernel, expressive deformable primitives, and a carefully staged multi‑sensor fusion pipeline, the method sets a new benchmark for autonomous‑driving perception and paves the way for more reliable, fine‑grained environmental modeling in real‑world conditions.
Comments & Academic Discussion
Loading comments...
Leave a Comment