RAP: 3D Rasterization Augmented End-to-End Planning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Imitation learning for end-to-end driving trains policies only on expert demonstrations. Once deployed in a closed loop, such policies lack recovery data: small mistakes cannot be corrected and quickly compound into failures. A promising direction is to generate alternative viewpoints and trajectories beyond the logged path. Prior work explores photorealistic digital twins via neural rendering or game engines, but these methods are prohibitively slow and costly, and thus mainly used for evaluation. In this work, we argue that photorealism is unnecessary for training end-to-end planners. What matters is semantic fidelity and scalability: driving depends on geometry and dynamics, not textures or lighting. Motivated by this, we propose 3D Rasterization, which replaces costly rendering with lightweight rasterization of annotated primitives, enabling augmentations such as counterfactual recovery maneuvers and cross-agent view synthesis. To transfer these synthetic views effectively to real-world deployment, we introduce a Raster-to-Real feature-space alignment that bridges the sim-to-real gap. Together, these components form Rasterization Augmented Planning (RAP), a scalable data augmentation pipeline for planning. RAP achieves state-of-the-art closed-loop robustness and long-tail generalization, ranking first on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive. Our results show that lightweight rasterization with feature alignment suffices to scale E2E training, offering a practical alternative to photorealistic rendering. Project page: https://alan-lanfeng.github.io/RAP/.

💡 Research Summary

The paper tackles a fundamental weakness of imitation‑learning (IL) based end‑to‑end (E2E) autonomous driving: policies are trained only on expert demonstrations, so when deployed in a closed‑loop they encounter out‑of‑distribution states for which they have never seen recovery data, leading to catastrophic failures. Recent works have tried to close this gap by synthesizing photorealistic “digital twins” of logged drives using neural rendering (NeRF, 3D Gaussian Splatting) or game‑engine simulators. While these methods can generate visually indistinguishable images, they are computationally expensive and therefore limited to evaluation or small‑scale training.

The authors argue that photorealism is unnecessary for training robust planners. Driving decisions depend primarily on geometry, semantics, and multi‑agent dynamics, not on textures or lighting. Guided by this insight, they propose RAP (Rasterization Augmented Planning), a scalable data‑augmentation pipeline that replaces costly rendering with a lightweight 3‑D rasterization of annotated primitives (lane polylines, oriented cuboids for vehicles, pedestrians, traffic lights, etc.).

1. 3‑D Rasterization

From each logged frame the system reconstructs a scene using only the available annotations.
A pinhole camera model (intrinsics K, extrinsics T) projects every 3‑D primitive onto the image plane.
Depth‑aware alpha compositing and Sutherland‑Hodgman clipping produce a clean RGB canvas while preserving occlusion ordering.
No textures, shading, or illumination are added; the output retains only the geometric and semantic cues that matter for planning.
Experiments show that features extracted by a frozen DINOv3 encoder from rasterized images occupy a similar sub‑space as those from real camera images (PCA visualisation), confirming that the abstraction is perceptually meaningful.

2. Data Augmentation Strategies

Recovery‑oriented perturbations – The expert trajectory τ* is deliberately disturbed with lateral/longitudinal offsets and Gaussian noise, yielding a counterfactual trajectory τ̃. The perturbed scene is rasterized, creating training samples where the ego vehicle drifts off the expert path and must learn to recover.

Cross‑agent view synthesis – For each scenario, the ego’s camera parameters are kept fixed while the ego’s trajectory is replaced by that of another agent in the log. This generates realistic viewpoints from other traffic participants without requiring additional sensors, dramatically increasing viewpoint diversity and interaction complexity.

Together these augmentations produce > 500 k synthetic samples, far exceeding the size of the original dataset.

3. Raster‑to‑Real (R2R) Feature‑Space Alignment

To bridge the sim‑to‑real gap, the authors align rasterized and real inputs in feature space rather than pixel space.

Spatial‑level alignment: For a paired real image xr and its raster counterpart xs, a visual encoder ϕ extracts spatial feature maps Fr and Fs. Fs is frozen; Fr is updated by minimizing a mean‑squared error loss across all spatial locations. This forces the real features to match the clean, geometry‑focused raster features.
Global‑level alignment: An average‑pooled global descriptor g is passed through a domain classifier with a Gradient Reversal Layer, encouraging domain‑invariant representations. This mitigates systematic differences such as background color or lighting that are absent in raster images.

The combination yields a model that can ingest real camera images at test time while having been trained extensively on cheap rasterized data.

4. Experiments and Results

RAP is evaluated on four prominent benchmarks: NAVSIM v1, NAVSIM v2, Waymo Open Dataset Vision‑based E2E Driving (WOD‑E2E), and Bench2Drive. The authors report a suite of metrics (navigation completion, distance to lane center, time‑to‑collision, comfort, etc.).

NAVSIM v1: RAP‑DINO achieves 99.1 % navigation completion, 98.9 % distance accuracy, and 96.7 % time‑to‑collision, surpassing prior state‑of‑the‑art methods such as DiffusionDrive and AutoVLA.
NAVSIM v2: RAP‑DINO leads across all reported metrics, showing superior closed‑loop robustness and long‑tail generalization.
Waymo & Bench2Drive: Similar gains are observed, confirming that the approach scales across different sensor setups and dataset domains.

Ablation studies demonstrate: (i) rasterization is > 10× faster than neural rendering, (ii) removing R2R alignment degrades performance, and (iii) each augmentation (recovery perturbation, cross‑agent view) contributes positively when evaluated in isolation.

5. Significance and Limitations

RAP validates the hypothesis that high‑fidelity visual rendering is not required for training robust E2E planners; semantic‑geometric fidelity suffices when coupled with feature‑space alignment. This dramatically reduces computational cost, enabling large‑scale data augmentation that was previously infeasible. The method opens a practical path toward scalable, closed‑loop training of vision‑based planners without the heavy infrastructure of photorealistic simulators.

However, because rasterization discards texture and lighting, scenarios where visual cues (e.g., traffic‑light color changes, weather‑induced visibility loss, subtle road‑mark variations) are critical may still pose challenges. Future work could integrate lightweight texture cues or multi‑modal sensors to address these edge cases.

In summary, RAP introduces a fast, annotation‑driven rasterization pipeline, two novel augmentation techniques, and a dual‑level feature alignment module that together achieve state‑of‑the‑art closed‑loop performance on multiple autonomous‑driving benchmarks, offering a cost‑effective alternative to photorealistic rendering for end‑to‑end planning.