What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected throug…

Authors: David Holtz, Niklas Hanselmann, Simon Doll

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?
What Matters f or Scalable and Rob ust Learning in End-to-End Driving Planners? David Holtz 1 , 2 Niklas Hanselmann 1 Simon Doll 1 Marius Cordts 1 Bernt Schiele 2 1 Mercedes-Benz A G 2 Max-Planck-Institute for Informatics, SIC Abstract End-to-end autonomous driving has gained significant at- tention for its potential to learn r ob ust behavior in inter- active scenarios and scale with data. P opular ar chitec- tur es often build on separate modules for perception and planning connected thr ough latent repr esentations, such as bir d’ s eye view feature grids, to maintain end-to-end dif- fer entiability . This paradigm emerg ed mostly on open-loop datasets, with evaluation focusing not only on driving per- formance, but also intermediate perception tasks. Unfortu- nately , ar chitectur al advances that excel in open-loop often fail to translate to scalable learning of r ob ust closed-loop driving. In this paper , we systematically r e-e xamine the im- pact of common arc hitectural patterns on closed-loop per- formance: (1) high-r esolution per ceptual r epr esentations, (2) disentangled trajectory repr esentations, and (3) gener - ative planning. Crucially , our analysis evaluates the com- bined impact of these patterns, r evealing both une xpected limitations as well as undere xplor ed syner gies. Building on these insights, we introduce BevAD , a novel lightweight and highly scalable end-to-end driving ar chitectur e . BevAD achie ves 72.7% success rate on the Bench2Drive bench- mark and demonstrates str ong data-scaling behavior using pur e imitation learning. Our code and models ar e publicly available her e: https://dmholtz.github.io/bevad/ 1. Introduction End-to-end autonomous dri ving (E2E-AD) has recently achiev ed great progress, driven by the possibility to opti- mize the entire stack in a planning-oriented manner [ 24 ]. Compared to classical approaches with rule-based compo- nents, this enables human-like behavior in complex sce- narios and promises performance gains that scale with data. While E2E-AD exists in various flavors, popular approaches [ 12 , 24 , 29 , 52 ] often implement modular but fully-differentiable transformer-based architectures with la- tent intermediate representations, such as bird’ s eye view (BEV) feature grids. Popularized on open-loop benchmarks Perception Planner BEV attention overfitting (a) High-capacity perceptual representation, e.g., high-resolution BEV path trajectory (b) Planning Representation point estimate diffusion samples (c) (Non-)generativ e Modeling Figure 1. Architectural Patter ns. (a) High-resolution BEV fea- tures facilitate perception tasks, but promote ov erfitting the plan- ner . (b) Closed-loop methods prefer path ov er trajectory represen- tations due to robust steering. (c) Point-estimates interpolate be- tween trajectory modes that diffusion-based sampling can breed. such as NuScenes [ 3 ], these works typically do not ev aluate in a closed-loop setting. Unfortunately , approaches opti- mized for open-loop performance [ 24 , 29 , 52 ] often fail to generalize in closed-loop dri ving scenarios [ 9 , 28 ], resulting in a div ergence in the directions of architectural adv ances between works that solely focus on either setting. In this paper, we take a step to wards consolidating these advances in E2E-AD with a focus on closed-loop driving. As depicted in Fig. 1 we systematically extend the design space proposed in ParaDriv e [ 52 ], by re-examining three architectural patterns: (1) The use of high-resolution per- ceptual representations as input to the planning module, 1 more common to the open-loop setting [ 24 , 34 ], (2) dis- entanglement of trajectories into lateral- and longitudinal components, mainly used in closed-loop dri ving [ 25 , 43 ] (3) the use of generativ e planners, previously undere xplored for end-to-end closed-loop driving [ 8 , 36 , 53 ]. By ev alu- ating these patterns jointly , we find that only one config- uration admits robust scaling of performance. In particu- lar , we observe that high-resolution perceptual representa- tions, shown to enable state-of-the-art (SotA) performance in open-loop [ 24 , 29 ], can be susceptible to causal confu- sion [ 10 ], and introduce a spatial bottleneck to mitigate this. Furthermore, we show that disentangled trajectory repre- sentations and generative planning via diffusion [ 36 , 53 ], previously studied only in isolation, provide complemen- tary benefits in modeling multi-modal beha vior , and see the strongest scaling properties when using both in conjunction. Building on these insights, we de v elop BevAD. It achiev es SotA closed-loop driving performance on the chal- lenging Bench2Driv e benchmark [ 28 ] based on the CARLA simulator [ 13 ], without any bells and whistles, and using camera sensors only . Additionally , we demonstrate that BevAD strongly benefits from data scaling, with difficult skills emerging as the dataset size increases. In summary , our core contributions are threefold, we: (1) Show that high-resolution perceptual representations can hinder learning rob ust planning and introduce a spatial bottleneck layer for mitigation. (2) Analyze ho w a disentangled planning representation and diffusion-based planning provide complementary benefits for closed-loop driving. (3) Integrate these insights and b uild Be vAD, a lightweight and highly scalable E2E-AD architecture that achiev es SotA closed-loop driving on Bench2Dri v e. 2. Related W ork End-to-end A utonomous Driving . T raditional au- tonomous dri ving (AD) stacks integrate standalone modules through compact predefined interfaces [ 1 ]. Although of- fering interpretability , this approach is prone to error accu- mulation, information loss between modules, and conflict- ing optimization objecti ves [ 24 ]. In contrast, the E2E-AD paradigm mitigates these limitations by optimizing the en- tire perception-to-planning pipeline, from raw sensor data to a planning trajectory or physical control commands [ 2 , 24 ]. UniAD [ 24 ] established a foundational frame work for modular E2E-AD by introducing query-based mechanisms to jointly optimize perception, prediction, and planning. A div erse landscape of stack-lev el designs has emerged, pri- marily categorized along three key dimensions: (1) T ask Selection. This encompasses tasks such as 3D object de- tection [ 47 ], multi-object tracking [ 12 , 24 ], online map- ping [ 24 , 29 , 47 , 52 ], motion prediction [ 24 , 29 ], occupanc y prediction [ 24 , 49 ], and increasingly , language-based tasks like commentary and visual question answering [ 15 , 43 ]; (2) Network T opology . Information flow is governed by sequential [ 23 ], parallel [ 6 , 52 ] and hybrid [ 24 , 29 , 47 ] module placements or by unified transformers [ 48 ]; (3) Intermediate Representation. These include sparse (e.g., instance-lev el) [ 12 , 47 ] and dense (e.g., BEV -centric [ 52 ] or occupancy-based [ 49 ]) representations. Despite this exten- siv e architectural landscape, our study uncov ers a miscon- ception about a subtle, yet piv otal detail within many exist- ing designs: High-resolution intermediate representations, such as an increased number of BEV features, ha ve demon- strated improv ements in open-loop perception [ 34 ] and are posited to improve planning performance [ 28 ]. Con versely , we propose spatially compressing the perceptual represen- tation before planner input and sho w substantially enhanced closed-loop driving robustness. This finding corroborates observations that other successful closed-loop dri ving meth- ods often employ architectural simplifications [ 43 , 56 ], such as relying solely on single front-facing cameras. W e sys- tematically shed light on the importance of the perceptual representation design and observe that 360-degree percep- tion works if the intermediate representation is otherwise limited in bandwidth. Imitation Learning for Planning. Learning to plan with an autonomous vehicle can be broadly categorized into two paradigms, (1) imitation learning (IL) via behavior cloning and (2) reinforcement learning. Research on IL sur ged after pioneering work [ 2 , 7 ] and with the a v ailability of large au- tonomous driving datasets [ 3 , 31 ] and simulators for closed- loop testing, such as CARLA [ 13 ]. Conditioning the driv- ing policy on a signal that represents the driv er’ s intention in addition to en vironment observations has become a piv- otal framew ork [ 7 ], allowing control ov er the policy at test time with na vigation commands or target points. How- ev er , behavior cloning is prone to introducing undesired side-effects such as covariate shift [ 42 ] and causal confu- sion [ 10 ], which are hard to detect with open-loop based ev aluation schemes, ev en with hardened metrics [ 9 ]. Data augmentation [ 25 , 43 ] and world models [ 21 , 42 ] help mit- igate these effects, though they do not solve them entirely . Moreov er , scaling up training data for IL has been sho wn to improve planning, following a power-la w relationship in open-loop metrics, though these gains saturated in closed- loop driving [ 40 , 54 ]. Many current E2E-AD solutions operate as point es- timators, directly regressing the final plan [ 6 , 47 , 56 ] in waypoint or action spaces, or by selecting from prede- fined anchors [ 5 ]. Both approaches can further benefit from trajectory post-processing [ 24 , 35 ]. An emerging al- ternativ e is generativ e models, particularly diffusion-based planners, which learn the distribution of future trajecto- ries conditioned on the scene and a gi ven command. At 2 test time, they sample from this conditional distribution. Such dif fusion-based planning can fit complex human driv- ing distributions better than point estimators [ 30 , 53 ], fur- ther enabling test-time guidance with driving style prefer- ences [ 53 ]. T runcated schedules were proposed to mitigate the computational ov erhead from iterativ e denoising [ 36 ]. Despite demonstrating strong performance in open-loop benchmarks [ 36 ], dif fusion-based planners have seen lim- ited adoption in closed-loop methods. Moreover , they hav e not been inv estigated in prior data scaling studies [ 40 , 54 ]. This work systematically compares point estimators and diffusion-based planners in closed-loop driving, rev ealing superior data scalability for diffusion-based approaches. Open-loop E2E-AD methods often generate temporal waypoint trajectories [ 24 , 29 , 36 , 47 , 52 ], thereby entan- gling lateral and longitudinal control. This representation can lead to sparse and ambiguous supervision, particularly for dynamic, multi-modal intersection scenarios [ 25 ]. In contrast, SotA closed-loop methods fav or disentangled out- puts: a future path independent of time and a target speed for longitudinal control [ 25 , 43 , 48 , 56 ]. Both genera- tiv e modeling and disentangled planning representations are patterns for learning multi-modal futures, yet their interac- tion remains underexplored. Our study reveals strong syn- ergies for scalable and rob ust end-to-end learning. 3. Revisiting Common Ar chitectural Patter ns W e briefly summarize common architectural patterns that were previously studied in isolation and occur predomi- nantly either in the open-loop or the closed-loop setting. 1. High-Resolution P er ceptual Repr esentations are em- ployed to impro ve performance in perception tasks [ 34 ], but are primarily studied in open-loop. While [ 28 ] pro- vides some evidence for benefits in closed-loop driv- ing, leading closed-loop methods in CARLA inherently employ lower -capacity representations due to their re- duced sensor configuration. W e aim to systematically re-examine the impact of the BEV size on learning pre- cise representations for closed-loop driving. 2. Disentangled Planning Repr esentations are employed by the top-three closed-loop methods in CARLA as a measure to reduce the ambiguity of multi-modal futures. 3. Generative Planners emerged as a principled measure to model multi-modality in dri ving, b ut their study is mainly driv en by open-loop methods. In the remainder of this section, we first introduce our anal- ysis framework (Sec. 3.1 ), along with the experiment setup (Sec. 3.2 ). Subsequently , we analyze the impact of the per- ceptual representation (Sec. 3.3 ), the patterns for modeling multi-modal future jointly (Sec. 3.4 ) and the implications on scaling (Sec. 3.5 ). 3.1. Analysis Framework Our analysis framework as shown in Fig. 2 is built upon ParaDri ve [ 52 ] for the following key reasons: (1) ParaDri ve provides a systematic, module-lev el architecture for E2E- AD stacks, of fering a well-defined foundation; (2) Its planner operates independently from auxiliary task, facil- itating a focused analysis of the perception-planner inter- face; (3) ParaDri ve’ s design, based on a real-world sen- sor configuration, has demonstrated strong performance on open-loop nuScenes tasks, unlike CARLA-specific meth- ods. Our framew ork prioritizes training efficienc y to scale beyond nuScenes [ 3 ] to larger imitation learning datasets for CARLA [ 43 , 56 ]. This is achieved through a stream- lined pipeline (Fig. 2a ), optimizing the BEV backbone and removing non-essential auxiliary tasks, while maintaining a realistic sensor setup. Key aspects of the components are stated belo w , further details are found in the supplementary . BEV Backbone. The BEV backbone processes images of N Cam = 6 cameras to produce BEV features F Bev with dimensions H × W , comprising RADIO [ 19 ] with low- rank adapter [ 22 ] as its image backbone and a BEV encoder based on BEVFormer [ 34 ]. Significantly improved runtime is achieved by replacing the recurrent BEV feature genera- tion with cached features streamed from short episode snip- pets during training [ 51 ]. Furthermore, we introduce a nov el camera augmentation technique, applying a random trans- formation BEV T car to all camera extrinsics to recover from compounding errors during closed-loop inference. A uxiliary T asks. W e implement a DETR-style object de- coder [ 4 ] with deformable cross-attention [ 55 ] to supervise the BEV features during training [ 34 ]. W e pruned other auxiliary tasks as used in [ 24 , 52 ] as they did not demonstra- bly improve closed-loop performance during initial tests, but introduced significant runtime o verhead. Planning Head. W e adopt a Transformer [ 50 ] decoder architecture as depicted in Fig. 2b : Self-attention among planning queries Q Plan enables mutual alignment, while cross-attention to scene tokens F Scene allows extraction of global scene features, following [ 24 , 52 ]. Subsequently , we employ a coarse-to-fine strategy using an optional de- formable attention layer , refining Q Plan by sampling local, high-resolution BEV features F Bev . Inspired by diffusion transformers [ 41 ], each multi-head attention block is en- closed by adaLN-Zero transformations, incorporating con- ditioning from high-lev el driving commands, the ego-state, and optionally the diffusion timestep. If the planner is a point estimator, the planning queries Q Plan are implemented as learnable embeddings. F or diffusion-based planners, Q Plan is generated by adding Gaussian noise to the ground truth according to a diffusion schedule such as DDIM [ 46 ] and embedding the result into the transformer’ s input space. By reinterpreting the planning queries as path and velocity 3 BEV Backbone Multi-View Images Parallel tasks 3D Object Detection Other Auxiliary T asks Planner modelling of multi- modality Scene T okenizer representation capacity T raining-only (a) Pipeline Overvie w high-resolution BEV masked scene tokens PE Masking Patchifying adaLN-Zero + Cross-Attention High-level Command learnable query Point-Estimate Diffusion noisy GT + adaLN-Zero + Cross-Attention Planning Output Scene T okenizer Planner adaLN-Zero + Self-Attention or (Optional) Deformable Attention (b) Proposed Scene T okenizer and Planning Head Figure 2. Analysis Framework. (a) W e build our analysis frame work on P araDriv e [ 52 ]. (b) W e introduce a scene tokenizer to reduce the spatial resolution of the BEV features. The design of our planning head is based on a diffusion transformer [ 41 ]. Crucially , the choice of the planning queries determines whether the planner is modeled as a point estimator or by diffusion. tokens instead of trajectory tokens and adjusting supervi- sion accordingly , we can modify the planning representa- tion. This flexible design enables analysis across different formulations without altering the planner’ s architecture. Controller . Following [ 25 , 28 , 43 ], we employ two PID controllers to conv ert planning outputs into steering and ac- celeration commands. The disentangled planning represen- tation facilitates PID controller design by allo wing separate processing of path and speed [ 25 ]. T o achieve the same for the trajectory representation, we fit a piecewise cubic Her- mite polynomial to the temporal waypoints and interpolate at fixed distances, while speed is derived using a second- order difference quotient. This allows consistent PID con- troller parameters across representations, minimizing the controller’ s critical impact on closed-loop driving. 3.2. Experiment Setup Data Collection. There are currently two popular data sources for expert demonstrations in CARLA [ 13 ] for train- ing imitation learning models: (1) Bench2Dri ve [ 28 ] pro- vides a dataset of expert demonstrations collected by the privile ged, RL-based Think2Driv e [ 33 ] expert along with sensor data and object annotations. (2) The of ficial CARLA leaderboard 2.0 benchmark provides specifications of long routes in CARLA with scenarios alongside, from which a dataset of expert demonstrations can be collected with the privile ged, rule-based expert PDM-lite [ 44 , 56 ]. Sim- lingo [ 43 ] and TF++ [ 56 ] split the long routes into shorter segments, each containing one scenario, and uniformly up- sample routes with rare scenarios. Due to v arious known label b ugs in the Bench2Dri ve dataset, we adopt the second approach. W e re-collect training data for our six-camera sensor setup using the same route specifications as Simlingo and use these routes for training, unless stated otherwise. T raining. W e conduct all experiments on 8xA100 80GB GPUs with a total batch size of 128 in mixed-precision (bfloat16) to balance efficiency , memory usage and stabil- ity . AdamW [ 38 ] Schedule-free [ 11 ] (learning rate: 2 − 4 ; weight decay: 0 . 01 ) is used for optimization. Our training consists of two stages: A warm-up stage over four epochs to initialize the BEV backbone with perception supervision, followed by a second stage that adds planning supervision. For faster con vergence, we freeze the BEV backbone for all second-stage experiments e xcept for studies on data scale. Benchmark and Metrics. W e perform closed-loop ev alu- ations on the challenging Bench2Drive benchmark [ 28 ] in CARLA [ 13 ]. Bench2Driv e comprises 220 short test routes, each featuring a single scenario, enabling analysis of spe- cific dri ving skills. W e report the official metrics driving score (DS) and success rate (SR). 3.3. High-Resolution Per ceptual Repr esentations Established BEV -based end-to-end architectures connect perception and planning through H × W high-resolution latent BEV features [ 24 , 52 ]. W e introduce a tokenizer (Fig. 2b ) that applies masking and patchifying to compress BEV features F Bev into scene tokens F Scene , thereby chan- neling spatial information through a bottleneck. Masking. W e propose using a key padding mask in the global cross-attention of the planner to exclude BEV cells where planning queries Q Plan cannot attend to. Our initial experiments tested various masking strategies, such as re- 4 moving distant parts to the left and right of the ego vehi- cle, and sophisticated masks based on the map segmenta- tion outputs. No significant differences were observed, so we use the simplest form, masking out 20% of the left- and right-most BEV cells. Although tailored to CARLA maps, this approach helps to determine if restricting the attention space facilitates learning a rob ust representation. Patchifying. Inspired by V ision Transformers [ 14 ], we propose pixel unshuffling for combining patches of p × p BEV features F Bev , (pixels) into spatial scene tokens F Scene , which our planner can globally attend to. W e prev ent the channel dimension of the scene tokens from growing by p 2 by projecting the output of pixel unshuffling to a lower - dimensional space, thereby enforcing a bottleneck. W e ex- plore p ∈ { 1 , 2 , 4 , 5 } . This analysis aims to understand ho w forced compression and sequence length in cross-attention impact learning a robust representation. Results. W e employ a 100 × 100 BEV space, con- sistent with UniAD-tiny [ 24 , 28 ], and a disentangled point-estimator planner [ 25 ], aligning with SoT A on Bench2Driv e [ 43 , 56 ]. T ab . 1 presents open- and closed- loop dri ving metrics for the tokenizer design space. W e ob- serve significant improvements in closed-loop driving per- formance as the scene token count is reduced via masking and patchifying. Specifically , restricting the planner’ s atten- tion to masked BEV features enhances closed-loop driving, ev en when the mask is applied solely at test time. Further- more, summarizing p × p BEV feature patches into scene tokens reduces the planner’ s token count by a factor of p 2 , yielding substantial closed-loop performance gains. De- spite the reduced BEV resolution, the L1 trajectory error marginally impro ves for p ≤ 4 . Howe ver , this compression strategy collapses for p ≥ 5 , resulting in a significant drop in both closed-loop and open-loop performance. Discussion. Transformer -based models are known to strug- gle with identifying relev ant information in long (text) se- quences, even with modest token counts [ 32 ]. W e relate this challenge to our setting, where high-resolution BEV inputs create long attention contexts. W e hypothesize that the plan- ner overfits to spurious correlations in training data by de- riving actions from memorized visual landmarks. Fig. 3 vi- sualizes qualitativ e examples of planning query mean cross- attention activ ations. In the absence of masking and patch- ing, it re veals numerous punctual, high activ ation patterns in distant, often occluded or irrelev ant BEV regions, strongly indicating causal confusion. These learned shortcuts are not measurable by open-loop metrics like L1 due to av eraging, but lead to catastrophic failures in distinct situations at test time. By reducing the token count through masking and patchifying, our approach mitigates this causal confusion, significantly enhancing closed-loop dri ving by learning a more robust representation for test time. Mask Patch Size Scene T okens DS ↑ SR ↑ L1 (m) ↓ ✗ 1 100 × 100 66.86 36.36 1.45 ✓ † 1 100 × 60 71.79 41.37 1.45 ✓ 1 100 × 60 72.40 41.97 1.53 ✓ 2 50 × 30 74.98 48.18 1.43 ✓ 4 25 × 15 82.62 57.43 1.43 ✓ 5 20 × 12 66.44 40.91 1.73 T able 1. Impact of T okenizer Design. Masking : Restricting plan- ning queries’ attention benefits closed-loop driving. P atchifying : Aggregating p × p BEV features into scene tokens significantly enhances closed-loop driving for p ≤ 4 . Notably , these closed- loop improv ements are not reflected in the open-loop L1 trajectory error metric. Le gend: † indicates test-time masking. Our finding contrasts with prior studies suggesting that higher BEV resolutions enhance do wnstream tasks such as 3D object detection [ 34 ]. This discrepancy stems from a fundamental difference between local detection and global planning tasks. DeformableDETR-style detection heads lev erage object locality by decoding queries to specific ref- erence points [ 34 , 55 ]. While increased BEV resolution enhances localization precision, it does not expand a sin- gle query’ s receptive field. In contrast, planning requires understanding critical scene elements that may not be local- ized near the immediate trajectory , thus necessitating global cross-attention [ 24 , 52 ]. In this global context, increas- ing BEV resolution e xpands the attention conte xt size, con- tributing to the observ ed performance degradation. 3.4. Modeling Multi-Modal Beha vior The problem of inherent multi-modality in driving beha v- ior is well-known in research [ 25 , 53 ]. Leading closed-loop methods in CARLA address this with a disentangled output representation that separates the spatial path from the speed profile instead of entangling them in a trajectory of tempo- ral waypoints [ 43 , 48 , 56 ]. Points on the path are obtained by sampling at fixed distances instead of fixed time inter- vals, which were shown to be less ambiguous, providing better supervision [ 25 ]. Meanwhile, diffusion models [ 20 ] can nativ ely address the multi-modality in entangled tempo- ral trajectories with generativ e modeling [ 36 , 53 ]. On first glance, both patterns appear to solve a similar problem. T o discern the individual contributions and potential synergies of trajectory representation and (non)generati ve modeling, we systematically ev aluate all four combinations. As we ob- serve that the DS and SR tend to obscure the distinct symp- toms of driving failures, we additionally introduce static and dynamic infraction rates IR s and IR d for this experiment. In a nutshell, IR s and IR d capture the pre v alence of failures due to wrong path planning and inappropriate acceleration respectiv ely; details can be found in the supplementary . 5 v = 8.2 m/s (a) p = 1 , without masking v = 0 m/s (b) p = 1 , with masking v = 0 m/s (c) p = 4 , with masking Figure 3. Qualitative visualization of the planning queries’ cross-attention to BEV features. Fig. 3a . The planner attends to distant BEV cells. Despite strong attention on the traffic light, the autonomous vehicle runs the red light. Fig. 3b : There are numerous attention spikes to random BEV cells, b ut barely no attention to the oncoming traf fic. Fig. 3c : The attention map significantly simplifies and exhibits fewer attention outliers. Model Repr . DS ↑ SR ↑ IR s ↓ IR d ↓ PE T 77.2 ± 0 . 6 51.7 ± 0 . 7 0.185 0.505 PE P+S 82.6 ± 1 . 0 57.4 ± 0 . 6 0.055 0.447 DI T 80.7 ± 3 . 0 56.2 ± 4 . 7 0.147 0.391 DI P+S 81.8 ± 1 . 5 59.4 ± 1 . 0 0.094 0.423 T able 2. Comparison of modeling and planning repr esentation. Le gend: PE: point-estimator, DI: diffusion, T : trajectory (entan- gled), P+S: path and speed (disentangled) representation. Results. As shown in T ab . 2 the disentangled representa- tion significantly reduces static infractions, regardless of the modeling approach. Particularly for point-estimators, this reflects strongly in the ov erall closed-loop scores, matching prior studies [ 25 ]. W e conclude that the disentangled rep- resentation is fav orable for learning robust steering. Gener- ativ e modeling with diffusion reduces dynamic infractions, regardless of trajectory representation. As a result, the en- tangled dif fusion-based variant achiev es similar ov erall SR than the disentangled point-estimator, though their failure modes are quite different. Further , we observe comple- mentary benefits for employing both diffusion-based mod- eling and disentangled representation, stated with the high- est overall SR. The lower driving score stems from -1.6% route completion, since the diffusion model is less willing to make an infraction for the sake of route progress. 3.5. Diminishing Returns when Scaling Non- Generative Planners The promise of scaling performance with data is one of the main advantages of E2E-AD. Since generative planning can capture the full distribution of behavior , we hypothesize that it shows stronger benefits from scaling the dataset size. In the following, we hence examine the scaling behavior of diffusion- compared to point estimator -based planning. Scaling Data. T o scale training data beyond current datasets [ 28 , 43 ], we build a route generator that exhaus- tiv ely plans short semantically plausible single scenario routes in all CARLA to wns. While being capable of build- ing > 10 6 unique route-scenario combinations (see supple- mentary), we only consider 8,000 uniformly sampled sce- narios as additional training data in our scaling experiments. For conducting data scaling experiments, we lev erage estab- lished protocols from [ 40 , 54 ]: W e create fi ve training splits from the joint set of Simlingo’ s [ 43 ] and our routes, each approximately doubling in size. These splits are cumulati ve (each being a subset of the larger ones [ 40 ]), maintaining the same scenario distribution across all splits. W e train all models on each data scaling split until con ver gence. Results. As reported in Fig. 4 , both v ariants improve mono- tonically in SR as we double the training dataset. In the low data regime, the point estimator slightly outperforms the dif fusion-based planner . After an inflection point (about 8000 training scenes), the growth rate decelerates, match- ing prior studies on closed-loop scaling laws for point es- timate planners [ 54 ]. On the other hand, diffusion-based planning maintains its linear rate of improvement until our largest data scaling split, and thereby substantially outper- forms the point-estimator counterpart. Interestingly , we cannot observe any saturation for the diffusion-based plan- ner , unlike [ 40 , 54 ] reported for closed-loop tests with point-estimator regressors. This opens up opportunities for further improv ements in the presence of larger datasets. 6 1000 2000 4000 8000 16000 T raining Dataset Size 30 40 50 60 70 80 90 100 Scor e Driving Scor e (Diffusion) Success R ate (Diffusion) Driving Scor e (P oint Estimator) Success R ate (P oint Estimator) Figure 4. Scaling Properties. Dif fusion demonstrates superior performance over point estimators when scaled with sufficient training data, despite initially underperforming with limited data. Emerging Skills. The scaling gains can also be broken down in terms of multi-ability ev aluation protocol from Bench2Driv e [ 28 ], where we observe that difficult skills emerge at larger training split sizes. For example, this is the case for the Give W ay and Merging skills as required for the yielding scenario depicted in Fig. 5 . Detailed results can be found in the supplementary . 4. BevAD Integrating the abov e insights, BevAD emer ges as a lightweight and highly scalable E2E-AD architecture from our analysis frame work in Fig. 2 . It applies synergies of architectural patterns, pre viously studied in isolation, for dealing with multi-modality in dri ving and combats overfit- ting with effecti ve BEV compression. W e compare BevAD to previous state-of-the-art in CARLA [ 13 ], providing a quantitativ e demonstration of BevAD’ s results (Sec. 4.1 ) along with qualitative results (Sec. 4.2 ) and real-world ex- periments on N A VSIM [ 9 ] (Sec. 4.3 ). 4.1. Comparison to State of the Art A comprehensi ve comparison of BevAD to other meth- ods on Bench2Driv e with respect to training data, sen- sor configuration, supervision signals and performance can be found in T ab . 3 . For a fair comparison, our model is denoted as BevAD-S when trained solely on Simlingo routes [ 43 ], and BevAD-M when trained with additional routes from our scaling study . W e consider UniAD [ 24 ] and V AD [ 29 ] as baselines since their module-level ar - chitecture is most similar to BevAD. W e report the ov er- all closed-loop driving score and success rate on the 220 test routes of Bench2Drive [ 28 ]. The significant improv e- ments of +34.8 DS and +38.9 SR of BevAD-S compared to UniAD highlight the effecti v eness of our tokenization as well as the complementary benefits of disentangled output representation and diffusion-based policy . By uniformly scaling up training scenarios, BevAD-M outperforms all prior methods in terms of DS and SR, as well as the con- current BridgeDri ve [ 37 ] in terms of DS. W e refer to the supplementary for the more fine-grained multi-ability ev al- uation [ 28 ] and analysis on driving skill e v olution. 4.2. Qualitative Results In the challenging Y ieldT oEmergencyV ehicle scenario, BevAD demonstrates the ability to yield to a rapidly ap- proaching emergency vehicle from behind. Fig. 5 illustrates that BevAD acquires this skill after scaling up training data. Prior methods failed in such scenarios, either due to the lack of 360-degree camera perception [ 43 ] or insufficient train- ing data [ 48 ]. This underscores BevAD’ s effecti ve utiliza- tion of its surrounding vie w BEV perception and its scal- ability . Additional qualitati ve closed-loop demonstrations are provided in the supplementary material. ∀ t (a) BevAD-S t 1 = 10s t 2 = 14s (b) BevAD-M Figure 5. Yield to Emergency V ehicle. By increasing the training dataset size, BevAD-M learns to yield to emergenc y vehicles (red) on highways by safely merging into slower traffic. This cability is absent at smaller data scales (BevAD-S) and in prior leading closed-loop methods[ 43 , 48 ]. Failur e Cases. W e analyze common failure modes of BevAD-M: (1) Red Light Infractions occur in 19% of un- successful closed-loop runs. For example, the driving model runs a red light in the PedestrianCrossing scenario after pedestrians hav e crossed, suggesting causal confu- sion. (2) Route Deviations occur when BevAD ignores lane change commands, causing incorrect exits on multi- lane roads. W e attribute this to weak conditioning signals from na vigation commands, which are often insufficient for timely lane changes. Strengthening conditioning with target points can mitigate this issue by guiding the model to wards the correct lane center similar to [ 25 ], though it increases re- liance on precise map localization. (3) Miscellaneous Col- 7 Method Details Overall Expert Sensors Labels Driving Score ↑ Success Rate ↑ V AD [ 29 ] Think2Driv e 6x CAM O, M 42.35 15.00 UniAD [ 24 ] Think2Driv e 6x CAM O, M 45.81 16.36 ThinkT wice [ 27 ] Think2Driv e 6x CAM, LiD AR O 62.44 31.23 Driv eAdapter [ 26 ] Think2Driv e 6x CAM, LiD AR O 64.22 33.08 Hydra-NeXt [ 35 ] Think2Driv e 2x CAM - 73.86 50.00 Orion [ 15 ] Think2Driv e 6x CAM O, L 77.74 54.62 TF++ [ 56 ] PDM-lite 1x CAM, LiD AR O, M, S, D 84.21 67.27 Simlingo [ 43 ] PDM-lite 1x CAM L 85.07 ± 0 . 95 67.27 ± 2 . 11 Hip-AD [ 48 ] Think2Driv e 6x CAM O, M 86.77 69.09 BridgeDriv e † [ 37 ] PDM-lite 1x CAM, LiD AR O, M, S, D 86.87 72.27 BevAD-S (our s) PDM-lite 6x CAM O 80.63 ± 1 . 76 55.30 ± 2 . 63 BevAD-M (our s) PDM-lite 6x CAM O 88.11 ± 0 . 98 72.73 ± 1 . 98 T able 3. Closed-loop Results on Bench2Drive. Despite its simpler design BevAD outperforms previous modular baselines UniAD and V AD by a large margin, reaching SOT A-level performance. W e highlight that BevAD can gain further substantial driving performance by uniformly scaling up training data. Leg end: O: 3D Object Detection, M: Map, S: Semantic Segmentation, D: Depth, L: Language, † : concurrent work. If available, we report mean and standard de viation ov er three seeds to account for the randomness in CARLA. lisions result from delayed reactions in time-critical scenar - ios or occur in situations that inv olve strong interaction with other vehicles, such as mer ging into flows. 4.3. Real-world Experiments W e ev aluate our method’ s real-world applicability on the N A VSIM planning benchmark [ 9 ]. T o match NA VSIM’ s expected planning representation, we adapt the diffusion planner to predict trajectories with associated yaw angles ov er a four-second horizon, and train BevAD end-to-end on the navtrain split for eight epochs. W e summarize performance on the navtest split in T ab. 4 , using the of- ficial N A VSIM metrics. BevAD outperforms representativ e baselines UniAD [ 24 ] and ParaDri ve [ 52 ] by 3.2 and 2.6 PDMS, respectiv ely , primarily due to improvements in dri v- able area compliance (D A C) and ego-progress (EP). No- tably , BevAD achiev es this performance with only object detection and planning supervision, in contrast to baselines that also leverage online-mapping and occupancy predic- tion supervision. BevAD’ s lightweight design yields a 570 GPU-hour (A100-80GB) training compute budget, 10x less than ParaDri ve [ 9 ]. Furthermore, we ablate the tokenizer design on real- world data: As shown in T ab . 4 , removing masking de- grades overall performance by 0.7 PDMS, while removing patchifying ( p = 1 ) results in a 1.0 PDMS degradation. This demonstrates the ef fectiv e generalization of our mask- ing and tokenizing scheme to a real-world setting. 5. Conclusion and Limitations W e presented BevAD, a lightweight and highly scalable E2E-AD model that achie ves SotA closed-loop driving on Method NC ↑ D A C ↑ TTC ↑ EP ↑ PDMS ↑ UniAD [ 24 ] 97.8 91.9 92.9 78.8 83.4 ParaDri ve [ 52 ] 97.9 92.4 93.0 79.3 84.0 BevAD (our s) 98.1 95.3 94.5 80.5 86.6 BevAD w/o mask 97.8 94.9 93.7 80.3 85.9 (-0.7) BevAD w/o patchifying 97.9 94.4 94.4 79.7 85.6 (-1.0) T able 4. Real-world Results on NA VSIM. Performance com- parison of BevAD against baseline methods on the real-world N A VSIM benchmark ( navtest ), including key ablations. Leg- end: NC: no at-fault collision, D A C: driv able area compliance, TTC: time-to-collsion, EP: ego progress, PDMS: PDM score. Bench2Driv e. BevAD emerges from our systematic analy- sis of common architectural patterns, previously studied in isolation. W e show that high-resolution BEV features can lead to overfitting, which we mitigate by forcing the planner to learn bottleneck. Additionally , planning with diffusion complements disentangled planning output representations, particularly excelling when scaled with data. W e ackno wledge sev eral limitations. First, while com- pressing the BEV along its spatial dimension significantly improv ed closed-loop driving, our approach may not di- rectly extend to high-speed highway scenarios, which re- quire long-range perception. A principled, conte xt-adaptiv e BEV masking strategy remains for future work. Second, our analysis of failure cases suggests potential causal con- fusions. Mitigating these, perhaps via incorporating world knowledge from VLMs or with reinforcement learning, re- quires further in vestigation. 8 Acknowledgments. This work is a result of the joint re- search project ST ADT :up (19A22006O). The project is sup- ported by the German Federal Ministry for Economic Af- fairs and Energy (BMWE), based on a decision of the Ger- man Bundestag. The authors are solely responsible for the content of this publication. References [1] Sagar Behere and Martin T orngren. A functional architecture for autonomous driving. In F irst International W orkshop on Automotive Softwar e Ar c hitecture (W ASA) , 2015. 2 [2] Mariusz Bojarski, Davide Del T esta, Daniel Dworako wski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathe w Monfort, Urs Muller , Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars, 2016. 2 [3] Holger Caesar, V arun Bankiti, Alex H. Lang, Sourabh V ora, V enice Erin Liong, Qiang Xu, Anush Krishnan, Y u Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In IEEE/CVF Confer- ence on Computer V ision and P attern Recognition (CVPR) , 2020. 1 , 2 , 3 [4] Nicolas Carion, Francisco Massa, Gabriel Synnaev e, Nico- las Usunier, Herv ´ e J ´ egou, and Justin Ponce. End-to-end ob- ject detection with transformers. In Eur opean Conference on Computer V ision (ECCV) , 2020. 3 [5] Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, W enyu Liu, and Xinggang W ang. V adv2: End-to-end vectorized autonomous driving via probabilistic planning. ArXiv , abs/2402.13243, 2024. 2 [6] Kashyap Chitta, Aditya Prakash, Bernhard Jae ger , Zehao Y u, Katrin Renz, and Andreas Geiger . T ransfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE T ransactions on P attern Analysis and Machine Intelligence , 45(11):12878–12895, 2023. 2 [7] Felipe Codevilla, Matthias Miiller , Antonio L ´ opez, Vladlen K oltun, and Alexe y Dosovitskiy . End-to-end driving via con- ditional imitation learning. In IEEE International Confer- ence on Robotics and Automation (ICRA) , 2018. 2 [8] Alexander Cui, Sergio Casas, Abbas Sadat, Renjie Liao, and Raquel Urtasun. Lookout: Di verse multi-future prediction and planning for self-driving. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2021. 2 [9] Daniel Dauner, Marcel Hallgarten, Tian yu Li, Xinshuo W eng, Zhiyu Huang, Zetong Y ang, Hongyang Li, Igor Gilitschenski, Boris Ivano vic, Marco Pav one, Andreas Geiger , and Kashyap Chitta. Navsim: data-dri ven non- reactiv e autonomous vehicle simulation and benchmarking. In 38th International Conference on Neur al Information Pro- cessing Systems , 2024. 1 , 2 , 7 , 8 [10] Pim de Haan, Dinesh Jayaraman, and Sergey Le vine. Causal confusion in imitation learning. In 33r d International Con- fer ence on Neural Information Pr ocessing Systems , 2019. 2 [11] Aaron Defazio, Xingyu (Alice) Y ang, Harsh Mehta, Kon- stantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky . The road less scheduled. In 38th International Confer ence on Neural Information Pr ocessing Systems , 2024. 4 [12] Simon Doll, Niklas Hanselmann, Lukas Schneider , Richard Schulz, Marius Cordts, Markus Enzweiler , and Hendrik P .A. Lensch. Dualad: Disentangling the dynamic and static world for end-to-end driving. In IEEE/CVF Conference on Com- puter V ision and P attern Recognition (CVPR) , 2024. 1 , 2 [13] Alex ey Dosovitskiy , German Ros, Felipe Code villa, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driv- ing simulator . In 1st Annual Confer ence on Robot Learning , 2017. 2 , 4 , 7 , 1 [14] Alex ey Dosovitskiy , Lucas Beyer , Alexander Kolesniko v , Dirk W eissenborn, Xiaohua Zhai, Thomas Unterthiner , Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly , Jakob Uszkoreit, and Neil Houlsby . An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Repr esenta- tions , 2021. 5 [15] Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing W ang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation. In IEEE/CVF International Con- fer ence on Computer V ision (ICCV) , 2025. 2 , 8 , 6 [16] Simon Gerstenecker , Andreas Geiger, and Katrin Renz. Plant 2.0: Exposing biases and structural flaws in closed-loop dri v- ing, 2025. 1 [17] T engda Han, Dilara Gokay , Joseph He yward, Chuhan Zhang, Daniel Zoran, V iorica P ˘ atr ˘ aucean, Jo ˜ ao Carreira, Dima Damen, and Andrew Zisserman. Learning from streaming video with orthogonal gradients. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 2 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. 2 [19] Greg Heinrich, Mike Ranzinger , Hongxu Danny Y in, Y ao Lu, Jan Kautz, Andrew T ao, Bryan Catanzaro, and Pavlo Molchanov . Radiov2.5: Improved baselines for agglomer- ativ e vision foundation models. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 3 , 2 [20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In 34th International Conference on Neural Information Pr ocessing Systems , 2020. 5 , 3 [21] Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zak Murez, Corina Gurau, Hudson Y eo, Alex K endall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learn- ing for urban driving. In 36th International Conference on Neural Information Pr ocessing Systems , 2022. 2 [22] Edward J Hu, yelong shen, Phillip W allis, Zeyuan Allen- Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. LoRA: Low-rank adaptation of large language models. In In- ternational Conference on Learning Repr esentations , 2022. 3 , 2 [23] Shengchao Hu, Li Chen, Penghao W u, Hongyang Li, Junchi Y an, and Dacheng T ao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In Eur opean Confer ence on Computer V ision (ECCV) , 2022. 2 9 [24] Y ihan Hu, Jiazhi Y ang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, W enhai W ang, Le wei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Y u Qiao, and Hongyang Li. Planning-oriented autonomous dri v- ing. In IEEE/CVF Conference on Computer V ision and P at- tern Recognition (CVPR) , 2023. 1 , 2 , 3 , 4 , 5 , 7 , 8 , 6 [25] Bernhard Jaeger , Kashyap Chitta, and Andreas Geiger . Hid- den biases of end-to-end driving models. In IEEE/CVF In- ternational Conference on Computer V ision (ICCV) , 2023. 2 , 3 , 4 , 5 , 6 , 7 [26] Xiaosong Jia, Y ulu Gao, Li Chen, Junchi Y an, Patrick Langechuan Liu, and Hongyang Li. Driv eadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous dri ving. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 8 , 6 [27] Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Y an, and Hongyang Li. Think twice before driv- ing: T ow ards scalable decoders for end-to-end autonomous driving. In IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2023. 8 , 6 [28] Xiaosong Jia, Zhenjie Y ang, Qifeng Li, Zhiyuan Zhang, and Junchi Y an. Bench2drive: towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving. In 38th International Conference on Neural Information Pr ocessing Systems Datasets and Benc hmarks T r ack , 2024. 1 , 2 , 3 , 4 , 5 , 6 , 7 [29] Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, W enyu Liu, Chang Huang, and Xinggang W ang. V ad: V ectorized scene representation for efficient autonomous driving. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 1 , 2 , 3 , 7 , 8 , 6 [30] Chiyu Max Jiang, Y ijing Bai, Andre Cornman, Christo- pher Davis, Xiukun Huang, Hong Jeon, Sakshum Kul- shrestha, John Lambert, Shuangyu Li, Xuanyu Zhou, Car- los Fuertes, Chang Y uan, Mingxing T an, Yin Zhou, and Dragomir Anguelov . Scenediffuser: efficient and control- lable driving simulation initialization and rollout. In 38th International Conference on Neural Information Pr ocessing Systems , 2024. 3 [31] Napat Karnchanachari, Dimitris Geromichalos, K ok Seang T an, Nanxiang Li, Christopher Eriksen, Shakiba Y aghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Y iluan Guo, and Holger Caesar . T o wards learning- based planning: The nuplan benchmark for real-world au- tonomous driving. In IEEE International Conference on Robotics and Automation (ICRA) , 2024. 2 [32] Y uri Kuratov , A ydar Bulatov , Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev . BABILong: T esting the limits of LLMs with long context reasoning-in-a-haystack. In The Thirty-eight Con- fer ence on Neural Information Pr ocessing Systems Datasets and Benchmarks T rack , 2024. 5 [33] Qifeng Li, Xiaosong Jia, Shaobo W ang, and Junchi Y an. Think2driv e: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2). In Eur opean Conference on Computer V ision (ECCV) , 2024. 4 , 1 [34] Zhiqi Li, W enhai W ang, Hongyang Li, Enze Xie, Chonghao Fang, Delong Lu, Xiaowei Zhu, and Gang Y u. Bevformer: Learning bird’ s-eye-vie w representation from multi-camera images via spatiotemporal transformers. In Eur opean Con- fer ence on Computer V ision (ECCV) , 2022. 2 , 3 , 5 , 4 [35] Zhenxin Li, Shihao W ang, Shiyi Lan, Zhiding Y u, Zuxuan W u, and Jose M Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2025. 2 , 8 , 6 [36] Bencheng Liao, Shao yu Chen, Haoran Y in, Bo Jiang, Cheng W ang, Sixu Y an, Xinbang Zhang, Xiangyu Li, Y ing Zhang, Qian Zhang, and Xinggang W ang. Dif fusiondriv e: Trun- cated diffusion model for end-to-end autonomous driving. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 2 , 3 , 5 , 4 [37] Shu Liu, W enlin Chen, W eihao Li, Zheng W ang, Lijin Y ang, Jianing Huang, Y ipin Zhang, Zhongzhan Huang, Ze Cheng, and Hao Y ang. Bridgedriv e: Diffusion bridge policy for closed-loop trajectory planning in autonomous driving, 2025. 7 , 8 , 1 , 2 , 4 , 6 [38] Ilya Loshchilov and Frank Hutter . Decoupled weight de- cay regularization. In International Conference on Learning Repr esentations , 2017. 4 [39] Calvin Luo. Understanding diffusion models: A unified per- spectiv e, 2022. 3 [40] Alexander Naumann, Xunjiang Gu, T olga Dimlioglu, Mar- iusz Bojarski, Alperen Degirmenci, Alexander Popov , De- vansh Bisla, Marco Pa vone, Urs M ¨ uller , and Boris Ivanovic. Data scaling laws for end-to-end autonomous driving. In IEEE/CVF Conference on Computer V ision and P attern Recognition W orkshops (CVPR W) , 2025. 2 , 3 , 6 [41] W illiam Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer V ision (ICCV) , 2023. 3 , 4 [42] Alexander Popov , Alperen Degirmenci, David W ehr , Shashank Hegde, Ryan Oldja, Alexey Kamenev , Bertrand Douillard, David Nist ´ er , Urs Muller , Ruchi Bharga va, Stan Birchfield, and Nikolai Smolyanskiy . Mitigating covariate shift in imitation learning for autonomous vehicles using la- tent space generative world models. In IEEE International Confer ence on Robotics and Automation (ICRA) , 2025. 2 [43] Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: V ision-only closed-loop autonomous dri ving with language-action alignment. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 2 , 3 , 4 , 5 , 6 , 7 , 8 , 1 [44] Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger , Ping Luo, Andreas Geiger , and Hongyang Li. Dri veLM: Driving with graph visual question answering. In European Confer- ence on Computer V ision (ECCV) , 2025. 4 , 1 [45] Jascha Sohl-Dickstein, Eric W eiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Pr oceedings of the 32nd International Confer ence on Machine Learning , 2015. 3 [46] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Confer ence on Learning Repr esentations , 2021. 3 , 4 10 [47] W enchao Sun, Xuewu Lin, Y ining Shi, Chuang Zhang, Haoran W u, and Sifa Zheng. Sparsedriv e: End-to-end autonomous driving via sparse scene representation. In IEEE International Confer ence on Robotics and A utomation (ICRA) , 2024. 2 , 3 [48] Y ingqi T ang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous dri ving in a single decoder . In IEEE/CVF International Confer ence on Com- puter V ision (ICCV) , 2025. 2 , 3 , 5 , 7 , 8 , 6 [49] W enwen T ong, Chonghao Sima, T ai W ang, Li Chen, Silei W u, Hanming Deng, Y i Gu, Lewei Lu, Ping Luo, Dahua Lin, and Hongyang Li. Scene as occupancy . In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 2 [50] Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Il- lia Polosukhin. Attention is all you need. In 31st Inter- national Conference on Neural Information Pr ocessing Sys- tems , 2017. 3 [51] Shihao W ang, Y ingfei Liu, Tiancai W ang, Y ing Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 3 , 2 [52] Xinshuo W eng, Boris Ivanovic, Y an W ang, Y ue W ang, and Marco P av one. Para-driv e: Parallelized architecture for real- time autonomous driving. In IEEE/CVF Confer ence on Com- puter V ision and P attern Recognition (CVPR) , 2024. 1 , 2 , 3 , 4 , 5 , 8 [53] Y inan Zheng, Ruiming Liang, K exin ZHENG, Jinliang Zheng, Liyuan Mao, Jianxiong Li, W eihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous dri ving with flexi- ble guidance. In International Conference on Learning Rep- r esentations , 2025. 2 , 3 , 5 [54] Y upeng Zheng, Pengxuan Y ang, Zhongpu Xia, Qichao Zhang, Y uhang Zheng, Songen Gu, Bu Jin, T eng Zhang, Ben Lu, Chao Han, Xianpeng Lang, and Dongbin Zhao. Data scaling laws for imitation learning-based end-to-end autonomous driving, 2025. 2 , 3 , 6 [55] Xizhou Zhu, W eijie Su, Lewei Lu, Bin Li, Xiaogang W ang, and Jifeng Dai. Deformable DETR: Deformable transform- ers for end-to-end object detection. In International Confer- ence on Learning Repr esentations , 2021. 3 , 5 [56] Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger , An- dreas Geiger, and Kashyap Chitta. Hidden biases of end-to- end driving datasets, 2024. 2 , 3 , 4 , 5 , 8 , 1 11 What Matters f or Scalable and Rob ust Learning in End-to-End Driving Planners? Supplementary Material This supplementary material details our data scaling pro- cedure (Sec. A ), presents implementation details for our analysis framework and BevAD (Sec. B ), and includes ad- ditional quantitativ e and qualitati ve results (Sec. C ). A. Dataset A.1. Bench2Drive Bench2Driv e [ 28 ] of fers expert demonstrations from the Think2Driv e expert [ 33 ], comprising over 13,000 episodes for training. Howe ver , the adoption of the dataset within the community is limited due to missing sensor modalities and language annotations [ 37 , 43 , 56 ]. Furthermore, we identify significant 3D label flaws, as depicted in Fig. 6 : (1) Unla- beled parked vehicles in all CARLA towns except T own12 and T own13, (2) Incomplete annotations for multi-lightbox traffic signals. (3) Misplaced bounding boxes for numer- ous traffic signs and pedestrians. These flaws create con- tradictory object detection supervision and enable planner shortcut learning, e.g., by distinguishing static cars from dynamic cars. Critically , the lack of public route files or open-source expert code hinder dataset extension and issue resolution. W e therefore follow [ 37 , 43 , 56 ] and utilize the CARLA [ 13 ] Leaderboard 2.0 training routes and the rule- based PDM-lite expert [ 44 , 56 ] for data collection. A.2. Route Generator Apart from the comprehensiv e Bench2Drive dataset, cur- rent CARLA imitation learning is limited by the div ersity of e xpert demonstrations. SotA methods [ 37 , 43 , 56 ] utilize short, single-scenario route segments from CARLA Leader - board 2.0 for training. Howe v er , this approach suffers from significantly imbalanced scenario distributions (e.g., scenario InterurbanAdv ancedActorFlow appears only six times, while others appear 40 times). While these methods employ upsampling of rare routes and extensi ve camera- and weather augmentations, this mitigation is limited. Fixed geographic contexts and similar actor behaviors can lead to ov erfitting to spurious correlations [ 16 ]. T o overcome these limitations, we build a novel route generator that creates unique route-scenario combinations within CARLA towns, enabling extensi ve and div erse data collection. Our approach maintains the established concept of short, single-scenario training routes, which facilitates fine-grained control over scenario distribution; for simplic- ity , we adopt a uniform distribution. The route generation algorithm proceeds in three steps: (1) Trigger Point Selec- tion, (2) Route Planning and (3) Scenario Generation. unlabeled unlabeled Figure 6. Label Flaws. Bench2Drive exhibits incomplete and erroneous 3D bounding box annotations in various scenes. T rigger Point Selection. W e employ a coarse-to-fine search strategy for trigger point selection. Initially , sce- narios are mapped to one of three coarse location classes: Intersection, No-Intersection, or Highway Ramp. Subse- quently , trigger points meeting these initial criteria are ex- haustiv ely sampled from all CARLA maps. This candidate list is then refined using scenario-specific criteria. For Inter- sections, relev ant features include traffic lights, stop signs, turning options, bike lanes, and pedestrian crossings. For No-Intersections, we consider the number of adjacent lanes (with/against ego traffic flow), and the presence of parking or shoulder lanes. Highway Ramps are differentiated into On- and Off-Ramps. For instance, the SignalizedJunctionLeftT urn scenario necessitates a signalized intersection with a left-turn option. The V ehicleOpensDoorT woW ays scenario requires a right- side parking lane for the adversary and an adjacent lane with oncoming traffic. Con versely , the Accident scenario demands a right-side shoulder lane and a left-side lane with traffic flowing in the same direction. These examples are visualized in Fig. 7 . Route Planning. W e plan routes through a trigger point using a bidirectional search on the lane graph to determine start and end points. This search is constrained to av oid additional intersections. Distances between the start point, trigger point, and end point are randomly sampled from scenario-dependent intervals. This strategy enhances vari- ance and mitigates the learning of distance- and location- dependent shortcuts. Scenario Generation. The final step inv olves configuring CARLA ’ s pre-defined scenario behavior models at the trig- ger point. T o enhance div ersity and prev ent spurious cor- relations, additional scenario parameters are sampled from meaningful intervals, for instance, by varying inter-v ehicle distances within traffic flo ws. 1 ego lane junction traffic lights left exit (a) Signalized Junction Left ego lane parking lane oncoming lane (b) Open Door shoulder lane ego lane neighbor lane (c) Accident Figure 7. Visualization of T rigger Point Selection Criteria. (a) This scenario requires a signalized junction with a left-turn exit relativ e to the ego lane. (b) The V ehicleOpensDoorT woW ays sce- nario requires a right-side parking lane for the adversary and an adjacent left-side lane for oncoming traffic. (c) The Accident sce- nario demands a right-side shoulder lane for the group of blocking vehicles and an adjacent left-side lane with the same traf fic flow . Our route generator produces ov er 100,000 unique route- scenario combinations across all CARLA maps. Due to the high cost of sensor data collection, we utilize a subset of 8,000 routes for our data scaling experiments. Ne verthe- less, the generator’ s extensi ve div ersity remains a signifi- cant asset for applications beyond imitation learning, such as reinforcement learning. B. Implementation Details B.1. Model Efficient Streaming T raining. The training performance of previous BEV -based E2E-AD architectures is bottle- necked by recurrent BEV feature generation, required for fusing temporal information within the BEV encoder’ s tem- poral self-attention layer [ 34 ]. For instance, UniAD [ 24 ] processes four past frames ( t − 4 , . . . , t − 1 ) at each training step t with a frozen BEV backbone to ev entually compute temporal self-attention between F ( t − 1) Bev and F ( t ) Bev . In con- trast, we adapt streaming training from object-centric mod- eling [ 51 ] to BEV -based architectures. Specifically , dur- ing training, we sample streams of n subsequent frames, feeding them sequentially into the network. A memory component caches current BEV features F ( t ) Bev at each step t , enabling subsequent steps ( t ′ = t + 1 ) to retrieve F ( t ′ − 1) Bev = F ( t ) Bev from cache instead of recurrently recom- puting it. Short sequences (e.g., 2s, n = 20 frames) are streamed to periodically reset the memory and mitigate non- orthogonal gradients [ 17 ] arising from strongly correlated Method BEV FPS (train) ↑ UniAD-tiny [ 24 , 28 ] frozen (stage-2) 2.5 BevAD (our s) frozen 89.0 end-to-end 55.0 T able 5. T raining Efficiency . BevAD achieves substantial train- ing speed-up relati ve to UniAD-tin y . Measured on 8xA100-80GB. frames. Our approach av oids four additional BEV back- bone forward passes (compared to UniAD stage-1) and sig- nificantly reduces CPU load by requiring only one set of multi-view images per step, rather than fi v e. Additionally , we emplo y RADIO [ 19 ] as an image back- bone with a lo w-rank adapter (LoRA) [ 22 ] for parameter - efficient finetuning. RADIO pro vides rich, generic features, and LoRA finetuning significantly reduces VRAM require- ments compared to backpropagating through a ResNet [ 18 ]. Pruning additional heads for online-mapping, motion fore- casting, and occupancy prediction further reduces Be vAD’ s memory footprint. These optimizations combined enable end-to-end train- ing with a batch size of 16 on a single A100-80GB GPU, a significant improv ement over UniAD, which required two- stage training with a frozen backbone due to VRAM con- straints. Our optimizations yield significantly higher train- ing sample throughput, ev en surpassing UniAD-tiny , as shown in T ab. 5 . Specifically , BevAD processes 35 × more training samples per second with a frozen BEV backbone, and achiev es a 22 × speed-up in end-to-end training. These optimizations represent a major contrib ution to wards scal- able imitation learning for robust closed-loop performance, a benefit we extend to the community through our open- source code release. Camera A ugmentation. Camera augmentations are in- tegral to robust closed-loop imitation learning [ 25 ]. Per- fect expert dri vers, such as PDM-lite, maintain precise lane centering, leading to a training distribution dominated by ideal states. Howe ver , during closed-loop testing, accumu- lated steering errors can cause the vehicle to drift, resulting in cov ariate shift [ 42 ] and degraded planner performance. A common mitigation in v olves augmenting driver camera views with random shifts and rotations during training [ 25 ], adopted by SotA methods on Bench2Driv e [ 37 , 43 , 56 ]. This augmentation simulates out-of-distribution states not observed with perfect e xpert driving. Howe v er , these shift and rotation augmentations are typ- ically limited to single front-facing camera setups and are challenging to apply to multi-camera systems. Further - more, their application to real-world data necessitates nov el view synthesis, introducing pipeline overhead and poten- tial artifacts. T o address these limitations, we propose a 2 nov el BEV -based augmentation strategy . Instead of manip- ulating raw sensor data, we augment the BEV coordinate system directly . This is achieved by sampling a random transformation matrix B ev T C ar , comprising a small yaw rotation γ ∼ [ − 22 . 5 ◦ , 22 . 5 ◦ ] and a lateral offset ∆ y ∼ [ − 0 . 75 m, 0 . 75 m ] . This matrix is then applied to all cam- era transformation matrices C ar T Cam i : Bev T Cam i = Bev T Car · C ar T Cam i By providing Bev T Cam i to the BEV encoder, it builds BEV features F Bev in the augmented BEV coordinate sys- tem rather than the vehicle coordinate system. Additionally , we apply this transformation to all ground truth labels, i.e., 3D bounding boxes and planning trajectories. Our augmen- tation prev ents the detection head and planner from learning a bias towards axis-aligned objects or trajectories relative to the ego vehicle. For example, the disentangled planner must learn to predict the future path in the augmented BEV space, necessitating a robust understanding of the BEV fea- ture grid. Unlike prior camera augmentations, our BEV - based scheme requires no augmented sensor data, as it only alters the generation of the latent BEV representation via a non-learnable, random transformation. Planning Queries. The planning query’ s interpretation de- termines the output representation. For an entangled trajec- tory representation, we define P Plan = P Traj ∈ R N t × N c , where N t is the number of trajectory points and N c is the feature dimension. For a disentangled representa- tion, P Plan = [ P Path ∥ P Speed ] , with P Path ∈ R N p × N c and P Speed ∈ R N t × N c , where N p denotes the number of path points. In our experiments, we set N t = 15 for a 3 s plan- ning horizon, resulting in temporal waypoint and speed pre- dictions spaced at 0 . 2 s intervals. For path planning, we use N p = 30 with waypoints spaced 1 m apart. Planning Head. The planning head (Fig. 2b ) comprises N layer = 8 transformer decoder layers with a feature dimen- sion of N c = 512 . Inspired by [ 41 ], we inte grate an adaLN- Zero transformation to condition the self-attention and global cross-attention layers with c , as depicted in Fig. 8 . For the point estimator, c = emb ( command ) + emb ( v 0 ) , combining a learnable embedding of the high-lev el navigation command and a sinusoidal embedding of the current velocity . For the diffusion-based plan- ner , c = emb ( command ) + emb ( v 0 ) + emb ( τ ) , addition- ally incorporating an embedding of the current diffusion timestep τ . B.2. Diffusion-based Planning Preliminaries. Dif fusion models are a class of genera- tiv e models that learn to rev erse a gradual noising pro- cess applied to data [ 20 , 45 ]. They define a fixed forward Markov chain that progressiv ely adds Gaussian noise to Layer Norm Scale, Shift Multi-Head X-Attention Scale Figure 8. adaLN-Zero Attention. W e employ an adaLN- Zero transformation to condition the transformer on c . For self- attention, we set K = V = Q Plan and for cross-attention, we set K = V = F Scene . data, transforming it into a pure noise distribution. Specifi- cally , the forward process admits sampling of noisy data x τ for arbitrary (time)steps τ in the Marko v chain in closed- form giv en clean data x 0 by sampling Gaussian noise ϵ ∼ N ( 0 , 1 ) [ 20 ]: x τ ( x 0 , ϵ ) = √ ¯ α τ x 0 + √ 1 − ¯ α τ ϵ (1) The sequence ¯ α 0 , . . . ¯ α T typically stems from a variance schedule with constant parameters [ 20 , 46 ]. The core of diffusion models lies in learning a reverse Markov chain that iterativ ely denoises samples over T timesteps, starting from random noise p ( x T ) = N ( x T ; 0 , I ) , to recover data samples from the original distribution [ 20 ]: p θ ( x 0: T ) := p ( x T ) T Y τ =1 p θ ( x ( τ − 1) | x τ ) (2) Instead of learning the Gaussian transitions p θ ( x τ − 1 | x τ ) directly , it is common practice to learn a function approxi- mator f θ : ( x τ , τ ) → x 0 parameterized by a neural network with learnable weights θ [ 39 ]. This allo ws learning the dif- fusion model by minimizing Eq. ( 3 ) for all timesteps τ [ 20 ]: ∥ x 0 − f θ ( x τ , τ ) ∥ (3) For diffusion-based planning, the clean planning ground truth x 0 is defined based on the planning representation: • Entangled: a series of normalized trajectory waypoints, x 0 = { ( x i , y i } N t i =1 } . • Disentangled: a tuple comprising normalized path waypoints and a normalized speed sequence, i.e., x 0 = ( { ( x i , y i } N p i =1 , { ( v i } N t i =1 } ) . T raining. During training, we sample a timestep τ from a uniform distribution and Gaussian noise ϵ ∼ N ( 0 , 1 ) . W e 3 obtain the noisy sample x τ using Eq. ( 1 ) and following the DDIM variance schedule [ 46 ]. The planning head serves as the conditional function approximator f θ : ( ˜ x τ , τ , z ) for the reverse process [ 36 ], tasked with predicting x 0 from ˜ x τ , the dif fusion timestep τ , and conditioning con- text z = ( F Scene , c ) . Note that x τ is embedded into a latent space via an MLP to obtain ˜ x τ before being fed to f θ . Inference. For inference, we start with a random sample x T ∼ N (0 , 1) and iteratively denoise it using the trained function approximator f θ ( ˜ x τ , τ , z ) and the DDIM sampling algorithm [ 46 ]. This process progressiv ely denoises x T to yield the final clean prediction x 0 . W e specifically em- ploy DDIM’ s accelerated generation that conducts denois- ing with a subset of S < T denoising steps to enhance com- putational efficienc y during sampling. DiffusionDrive. W e do not adopt DiffusionDri ve’ s [ 36 ] truncated diffusion schedule in our experiments. This deci- sion stems from our observation, that DiffusionDri ve’ s for- ward process adds noise to a fixed set of anchor trajectories, while the reverse process aims to predict ground truth tra- jectories, leading to an asymmetric reversal. This issue is also thoroughly discussed in concurrent work [ 37 ], which proposes a theoretically sound dif fusion bridge formulation as a solution. B.3. Controller W e adopt the disentangled PID controllers from Sim- lingo [ 43 ] for lateral and longitudinal control. All controller parameters are retained, with the exception of the steering proportional gain, which is slightly lo wered to P steer = 1 . 8 to reduce ov ersteering in our closed-loop agent. B.4. Loss The overall loss function for end-to-end training combines terms for 3D object detection and planning: L = λ det L det + λ plan L plan (4) The detection loss L det , adopted from [ 34 ], includes classi- fication and regression components. The planning loss L plan varies with the planner’ s modeling choice: • For regression-based planning with a point estimator, L plan is a smooth L 1 loss applied to the trajectory error, or path and speed error . • For dif fusion-based planning, L plan is a smooth L 1 loss on the x 0 -prediction error , as defined in Eq. ( 3 ). The loss coefficients are set to λ det = 1 and λ plan = 100 to balance the magnitudes of the detection and planning loss components. B.5. Metrics T o better characterize closed-loop failure modes, we intro- duce auxiliary metrics: the static infraction rate (IR s ) and dynamic infraction rate (IR d ). IR s = N layout-collision + N outside-lane N routes (5) This metric quantifies infractions related to lateral control errors, such as collisions with static layout elements or driv- ing outside the lane. IR d = N actor-collision + N red-light + N stop-sign N routes (6) This metric captures infractions arising from longitudinal control errors and interactions with dynamic elements, in- cluding collisions with actors, running red lights, or failing to stop at stop signs. Collectiv ely , these metrics provide the expected number of infractions per route, offering a granu- lar understanding of control deficiencies. C. Results This section presents an ablation study , quantitativ e results on the Bench2Driv e benchmark, and qualitati ve demonstra- tions of BevAD’ s closed-loop driving. C.1. Ablations Our ablation studies in vestigate early design choices. For the following ablations of camera augmentation and BEV size, our baseline is a planning head with a tokenizer ( p = 4 , masking), disentangled representation, and a point- estimator regressor , as detailed in Sec. 3.3 and T ab . 1 . Fur- thermore, we ablate the number of denoising steps in the diffusion planner using our strongest model, Be vAD-M. Camera A ugmentation. T o assess the effectiv eness of our nov el BEV -based augmentation, we compare the baseline against a variant trained without it. As shown in T ab. 6 , the absence of camera augmentation significantly degrades closed-loop performance. This degradation primarily stems from a 5 . 7 × increase in static infractions, leading to re- duced route completion and increased secondary collisions. These results underscore the critical role of our augmen- tation scheme in promoting robust dri ving and enabling recov ery from compounding steering errors. The results also highlight the insufficienc y of common open-loop met- rics, as the L1 trajectory error does not reflect the observed degradation in model rob ustness. BEV Size. Our tokenizer compresses high-resolution BEV features ( F Bev ) into low-resolution scene tokens ( F Scene ). An alternative is to directly learn a low-resolution BEV fea- ture space. As shown in T ab. 7 , despite exposing the same number of scene tok ens to the planner , direct low-resolution BEV generation significantly degrades closed-loop perfor- mance. Specifically , the deformable BEV -image cross- attention of our BEV encoder (based on BEVFormer [ 34 ]) extracts image information more sparsely when generating 4 Augmentation DS ↑ SR ↑ IR s ↓ L1 (m) ↓ ✓ 82 . 62 57 . 43 0 . 055 1 . 43 ✗ 66 . 60 33 . 64 0 . 314 1 . 47 T able 6. Ablation of camera augmentation. The absence of cam- era augmentation significantly degrades closed-loop performance, despite minimal impact on open-loop L1 trajectory de viation. This underscores the contribution of our BEV -based augmentation to robust dri ving and cov ariate shift mitigation. BEV Scene T okens DS ↑ SR ↑ IR s ↓ 100 × 100 25 × 15 82 . 62 57 . 43 0 . 055 25 × 25 25 × 15 72 . 18 40 . 36 0 . 409 T able 7. BEV Resolution and T okenization. High-resolution BEV generation, compressed via our tokenizer , yields supe- rior closed-loop driving performance compared to direct low- resolution BEV generation. S DS ↑ SR ↑ FPS ↑ 10 88.11 72.73 4.2 5 88.33 72.72 5.8 2 88.53 72.72 7.5 T able 8. Impact of Denoising Iterations. Reducing denoising steps significantly boosts inference FPS while preserving closed- loop dri ving performance, enabling real-time applications. FPS measured on a Quadro R TX 8000. low-resolution BEVs, potentially omitting fine details. Fur- thermore, it prev ents le veraging deformable refinement lay- ers for sampling local, high-resolution features around the future trajectory . As a result, we observe 7 . 4 × more static infractions with the low-resolution BEV encoder compared to our compression approach. This finding underscores the necessity of compressing high-resolution representa- tions rather than directly learning low-resolution ones. Number of denoising steps. The iterati ve denoising of diffusion models critically impacts the inference latency in real-world deployments. The runtime of our diffusion- based planner linearly increases with the number of denois- ing steps S , while the point estimator planner has constant runtime, corresponding to S = 1 . W e thus ev aluate how S affects closed-loop dri ving performance and inference FPS in T ab. 8 . In contrast to Dif fusionDriv e [ 36 ], we achiev e constant driving performance for S ∈ { 2 , 5 , 10 } , without applying their truncated diffusion framework. W e attribute this to a bug in their diffusion schedule, which we detail in the appendix. 1000 2000 4000 8000 16000 T raining Dataset Size 20 30 40 50 60 70 80 Success R ate (%) Mer ging Overtaking Emer gency Br eak Give W ay T raffic Sign Figure 9. Emerging Skills. All skills remain underde veloped with fewer than 4,000 training episodes. As the v olume of training data increases, the skills progressiv ely and uniformly emerge. C.2. Multi-Ability Evaluation T o gain a nuanced understanding of system performance in closed-loop driving, we employ the fine-granular multi- ability ev aluation protocol from Bench2Drive [ 28 ]. This protocol defines fiv e advanced urban dri ving skills: Merg- ing, Overtaking, Emergency Brake, Gi ve W ay , and Traf fic Sign. Each of the 220 test routes is mapped to one or more skills necessary for successfully navig ating the scenario. T ab . 9 presents a comparison of BevAD’ s multi-ability scores against prior work. Be vAD consistently achieves high scores ( > 70% ) across all skills, dominating the mean score. In contrast, prior state-of-the-art methods [ 43 , 48 ] exhibit unev en performance, excelling in some skills while underperforming in others. BevAD-M surpasses the pre- vious best in Merging and Giv e W ay skills by +8.17 and +23.34, respecti vely . These skills demand comprehensi ve surround perception, underscoring BevAD’ s effecti ve uti- lization of its multi-view camera system. Howe ver , BevAD lacks in terms of Overtaking, Emer gency Brake, and T raffic Sign skills compared to the best prior or concurrent methods for each skill. Building on the discussion in Sec. 3.5 , we present the progression of BevAD’ s closed-loop driving skills as the training dataset size increases, detailed in Fig. 9 . W ith fe wer than 4,000 training episodes, all skills exhibit a success rate below 50%. Howe v er , as the training data is doubled and quadrupled, the skills show consistent and uniform im- prov ement. This enhancement is e videnced by high success rates in complex scenarios, such as ov ertaking amidst on- coming traffic, merging onto highways and into traffic flo w at intersections, and yielding to emergenc y vehicles. 5 Method Ability (%) ↑ Merging Overtaking Emergency Brak e Giv e W ay T raffic Sign Mean V AD [ 29 ] 8.11 24.44 18.64 20.00 19.15 18.07 UniAD-Base [ 24 ] 14.10 17.78 21.67 10.00 14.21 15.55 ThinkT wice [ 27 ] 27.38 18.42 35.82 50.00 54.23 37.17 Driv eAdapter [ 26 ] 28.82 26.38 48.76 50.00 56.43 42.08 Hydra-NeXt [ 35 ] 40.00 64.44 61.67 50.00 50.00 53.22 Orion [ 15 ] 25.00 71.11 78.33 30.00 69.15 54.72 TF++ [ 25 ] 58.75 57.77 83.33 40.00 82.11 64.39 Simlingo [ 43 ] 54.01 ± 2 . 63 57.04 ± 3 . 40 88.33 ± 3 . 34 53.33 ± 5 . 77 82.45 ± 4 . 73 67.03 ± 2 . 12 Hip-AD [ 48 ] 50.00 84.44 83.33 40.00 72.10 65.98 BridgeDriv e † [ 37 ] 63.50 58.89 88.34 50.00 88.95 69.93 BevAD-S (our s) 55.83 ± 0 . 72 53.33 ± 6 . 67 63.33 ± 1 . 93 46.67 ± 5 . 77 60.88 ± 3 . 08 56.01 ± 3 . 78 BevAD-M (our s) 71.67 ± 2 . 60 74.07 ± 1 . 29 75.56 ± 4 . 41 76.67 ± 5 . 77 75.44 ± 1 . 61 74.68 ± 1 . 24 T able 9. Multi-Ability Evaluation. BevAD consistenly achie ves high results across all skills, dominating the mean score score. Notably , it significantly surpasses prior w ork in the Mer ging and Give W ay skills. Howe ver , it exhibits comparatively lower performance in Overtaking , Emer gency Brak e , and T raf fic Sign skills. Le gend: † : concurrent work. C.3. Qualitative Results W e provide qualitati ve closed-loop dri ving e xamples for each multi-ability skill of BevAD-M in Fig. 10 , 11 , 12 , 13 , 14 and 15 . The examples are best viewed when zoomed in and visualize planned ego trajectories that are generated by rolling out the predicted speed profile along the predicted path. This depiction aids in understanding the dynamic planning component, but is solely for visualization. It does not serve as controller input, nor does it af fect the lateral accuracy of the predicted path. 6 Figure 10. Merging . BevAD merges from a parallel parking space into traffic. It yields to rear-end flow of vehicles, identifies a safe gap, and accelerates for seamless merging. Figure 11. Overtaking (1). BevAD executes an o vertaking maneuver when the route is blocked by a construction vehicle. It waits for clear oncoming traffic, then steers into the opposing lane, accelerates to quickly pass the obstacle, and subsequently decelerates when returning to its original lane. 7 Figure 12. Overtaking (2). BevAD approaches a group of cyclists. It ex ecutes a safe left lane change to ov ertake them. After the maneuver , BevAD returns to its original lane, maintaining a safe distance to the c yclists. Figure 13. Emergency Brake. BevAD brakes at a green-light intersection due to a pedestrian crossing its left-turn path. Driving resumes upon pedestrian clearance. 8 Figure 14. Give W ay . BevAD yields to an oncoming vehicle encroaching on the ego lane. It performs a controlled rightward deviation from the lane center , staying within road limits, and re-centers once the oncoming traffic has passed. Figure 15. T raffic Sign. BevAD stops at a stop sign at an intersection with cross-traffic. It waits for a safe gap, then executes a right turn and merges into the traf fic flo w . 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment