TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception
Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Almost all existing work focus on a single frame of LiDAR point cloud and neglect the temporal LiDAR sequence, which naturally accounts for object motion (and their semantics). Instead, we propose TREND, namely Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. Unlike existing work that follows conventional contrastive learning or masked auto encoding paradigms, TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embedding across time and a Temporal Neural Field to represent the 3D scene, through which we compute the loss using differentiable rendering. To our best knowledge, TREND is the first work on temporal forecasting for unsupervised 3D representation learning. We evaluate TREND on downstream 3D object detection tasks on popular datasets, including NuScenes, Once and Waymo. Experiment results show that TREND brings up to 90% more improvement as compared to previous SOTA unsupervised 3D pre-training methods and generally improve different downstream models across datasets, demonstrating that indeed temporal forecasting brings improvement for LiDAR perception.
💡 Research Summary
The paper introduces TREND (Temporal Rendering with Neural Field), a novel unsupervised pre‑training framework for LiDAR‑based 3D perception that leverages the temporal dimension of point‑cloud sequences. Existing unsupervised methods—masked auto‑encoders or contrastive learning—operate on single frames and ignore motion cues that naturally encode semantic information. TREND addresses this gap by formulating the pre‑training task as future‑point‑cloud forecasting.
The pipeline consists of three main components. First, a standard 3D encoder (e.g., sparse 3D CNN) processes the current LiDAR sweep (P_{t_0}) into a dense feature volume (\hat{P}_{t_0}). Second, a Recurrent Embedding scheme propagates this representation forward in time. Ego‑vehicle actions between consecutive timestamps ((\Delta x, \Delta y, \Delta \theta)) are sinusoidally encoded for translation and represented as (
Comments & Academic Discussion
Loading comments...
Leave a Comment