Mixed Precision PointPillars for Efficient 3D Object Detection with TensorRT
LIDAR 3D object detection is one of the important tasks for autonomous vehicles. Ensuring that this task operates in real-time is crucial. Toward this, model quantization can be used to accelerate the runtime. However, directly applying model quantization often leads to performance degradation due to LIDAR’s wide numerical distributions and extreme outliers. To address the wide numerical distribution, we proposed a mixed precision framework designed for PointPillars. Our framework first searches for sensitive layers with post-training quantization (PTQ) by quantizing one layer at a time to 8-bit integer (INT8) and evaluating each model for average precision (AP). The top-k most sensitive layers are assigned as floating point (FP). Combinations of these layers are greedily searched to produce candidate mixed precision models, which are finalized with either PTQ or quantization-aware training (QAT). Furthermore, to handle outliers, we observe that using a very small number of calibration data reduces the likelihood of encountering outliers, thereby improving PTQ performance. Our methods provides mixed precision models without training in the PTQ pipeline, while our QAT pipeline achieves the performance competitive to FP models. With TensorRT deployment, our mixed precision models offer less latency by up to 2.538 times compared to FP32 models.
💡 Research Summary
The paper addresses the challenge of deploying LIDAR‑based 3D object detection models, specifically PointPillars, on edge devices where real‑time performance is mandatory. While post‑training quantization (PTQ) to 8‑bit integer (INT8) can dramatically reduce memory bandwidth and arithmetic cost, directly applying INT8 to PointPillars leads to severe drops in average precision (AP) because LIDAR point clouds exhibit very wide numerical ranges and contain extreme outliers.
To mitigate these issues, the authors propose a mixed‑precision framework that automatically identifies the most “sensitive” layers and keeps them in higher precision (FP16) while quantizing the remaining layers to INT8. The sensitivity search is performed entirely with PTQ: each layer is individually replaced by an INT8 version, the model is calibrated with a tiny set of calibration frames (four LIDAR sweeps), and the resulting AP40 on the validation set is recorded. Layers that cause the largest AP degradation are deemed sensitive.
Because exhaustively testing all possible subsets of sensitive layers would be combinatorial, the authors adopt a greedy strategy. After ranking layers by sensitivity, they generate k candidate mixed‑precision models by progressively promoting the top‑1, top‑2, …, top‑k layers to FP16 while leaving the rest INT8. Only these k candidates are evaluated, drastically reducing search cost. The selected candidates are then finalized either by a second PTQ pass (re‑calibrating scales) or by quantization‑aware training (QAT) to further close the accuracy gap.
A second key insight concerns the calibration data size. In min‑max PTQ, the scaling factor is determined by the maximum absolute activation observed during calibration. LIDAR data’s sparsity means that a few outlier points can inflate this maximum, leading to large quantization steps and high rounding error. By deliberately limiting calibration to only four frames, the probability of encountering such outliers drops dramatically, resulting in a much smaller scaling factor and a 23.8 % improvement in mean AP compared to using 4 096 frames.
Experiments are conducted on the KITTI 3D detection benchmark using the MMDetection3D framework. FP32 baseline achieves 86.35 % AP40 for the car class. A naïve INT8 PTQ model falls to 27.29 % AP40, confirming the difficulty of direct quantization. Mixed‑precision PTQ models, where the three most sensitive layers (voxel encoder linear, bbox head conv regression, and backbone block 0.3) are kept in FP16, recover AP40 to around 70 % without any additional training. Mixed‑precision QAT models further close the gap, reaching 75.06 % AP40, essentially matching FP32 performance while still using INT8 for the majority of the network.
Latency measurements on TensorRT‑enabled hardware (NVIDIA Jetson Orin and RTX 4070Ti) show that INT8 layers are consistently faster than FP16 or FP32. For example, backbone block 0.0 runs in 0.376 ms as INT8 versus 1.382 ms as FP32. Overall, the mixed‑precision models achieve up to 2.538× lower inference latency compared with the FP32 baseline, while preserving most of the detection accuracy.
In summary, the contributions are threefold: (1) an automated PTQ‑based layer‑sensitivity analysis, (2) a greedy mixed‑precision search that optimizes accuracy rather than latency, and (3) the observation that extremely small calibration sets mitigate outlier‑induced scaling errors. The approach requires no architectural changes, works with standard TensorRT symmetric quantization, and demonstrates that high‑accuracy, low‑latency 3D object detection is feasible on resource‑constrained platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment