ForecastOcc: Vision-based Semantic Occupancy Forecasting

ForecastOcc: Vision-based Semantic Occupancy Forecasting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.


💡 Research Summary

ForecastOcc introduces the first end‑to‑end vision‑only framework that jointly predicts future 3D occupancy and semantic categories directly from past camera images. Existing occupancy‑forecasting approaches either predict only a coarse static‑dynamic split or, in the case of semantic occupancy forecasting, rely on a separate occupancy‑estimation network whose errors propagate to the forecasting stage and increase computational cost. ForecastOcc eliminates this two‑stage pipeline by learning spatio‑temporal features directly from raw images.

The architecture builds on the BEVDet4D (referred to as BevOcc) backbone. A multi‑camera image encoder (EfficientNet‑B3 with a feature‑pyramid neck) extracts four‑scale 2D feature maps and fuses them into a unified 256‑channel representation at 1/16 resolution. These features are enriched with three learned embeddings: scale, camera view, and temporal position, producing context‑aware tensors.

The core forecasting module synthesizes future‑aware 2D features for each prediction horizon (1 s, 2 s, 3 s). A future‑state query is initialized from the current‑frame features and iteratively updated through a stack of L future‑interaction layers. Each layer contains (i) multi‑head cross‑attention between the query and the enriched features of a specific past frame, (ii) multi‑head self‑attention among the query tokens, (iii) a two‑layer feed‑forward network, and (iv) a shared “Future State Synthesizer” – a three‑layer MLP that refines the query. This design enables the model to progressively accumulate temporal cues and generate latent representations aligned with future observations.

During training, a Future State Alignment loss aligns the synthesized future features with those obtained by encoding the actual future images (the image encoder is frozen). The loss combines a Huber term (to penalize magnitude differences) and a cosine‑similarity term (to enforce directional consistency).

The synthesized 2D features are then lifted into a 3D voxel grid (64 × 16 × 200 × 200) using a Lift‑Splat‑Shoot style view transformer that incorporates depth distributions and camera intrinsics/extrinsics. A 3D ResNet encoder with a feature‑pyramid neck refines the voxel features, after which a semantic occupancy head predicts voxel‑wise logits for C_cls semantic classes (including a free‑space class) across all horizons.

Experiments are conducted on two benchmarks: multi‑view forecasting on the Occ3D‑nuScenes dataset (six synchronized cameras) and monocular forecasting on SemanticKITTI. Baselines are created by plugging two 2D forecasting modules (ConvLSTM and a Transformer) into the same pipeline but without the proposed cross‑attention and alignment mechanisms. ForecastOcc consistently outperforms these baselines, achieving a mean IoU improvement of roughly 7–12 percentage points. Ablation studies confirm that each component—temporal cross‑attention, view transformer, and alignment loss—contributes significantly to performance.

Key limitations include dependence on fixed camera calibration, a relatively coarse voxel resolution (≈0.5 m) that hampers detection of small distant objects, and the need for future images during training, which may restrict fully online deployment. Future work could address dynamic calibration, adaptive or higher‑resolution voxel representations, and multimodal fusion with LiDAR or radar to improve robustness under adverse conditions.

In summary, ForecastOcc demonstrates that semantic occupancy forecasting can be performed directly from images without intermediate map generation, delivering richer, future‑aware 3D scene understanding essential for autonomous driving and other anticipatory robotics applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment