What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

What Matters f or Scalable and Rob ust Learning in End-to-End Driving Planners? David Holtz 1 , 2 Niklas Hanselmann 1 Simon Doll 1 Marius Cordts 1 Bernt Schiele 2 1 Mercedes-Benz A G 2 Max-Planck-Institute for Informatics, SIC Abstract End-to-end autonomous driving has gained signiﬁcant at- tention for its potential to learn r ob ust behavior in inter- active scenarios and scale with data. P opular ar chitec- tur es often build on separate modules for perception and planning connected thr ough latent repr esentations, such as bir d’ s eye view feature grids, to maintain end-to-end dif- fer entiability . This paradigm emerg ed mostly on open-loop datasets, with evaluation focusing not only on driving per- formance, but also intermediate perception tasks. Unfortu- nately , ar chitectur al advances that excel in open-loop often fail to translate to scalable learning of r ob ust closed-loop driving. In this paper , we systematically r e-e xamine the im- pact of common arc hitectural patterns on closed-loop per- formance: (1) high-r esolution per ceptual r epr esentations, (2) disentangled trajectory repr esentations, and (3) gener - ative planning. Crucially , our analysis evaluates the com- bined impact of these patterns, r evealing both une xpected limitations as well as undere xplor ed syner gies. Building on these insights, we introduce BevAD , a novel lightweight and highly scalable end-to-end driving ar chitectur e . BevAD achie ves 72.7% success rate on the Bench2Drive bench- mark and demonstrates str ong data-scaling behavior using pur e imitation learning. Our code and models ar e publicly available her e: https://dmholtz.github.io/bevad/ 1. Introduction End-to-end autonomous dri ving (E2E-AD) has recently achiev ed great progress, driven by the possibility to opti- mize the entire stack in a planning-oriented manner [ 24 ]. Compared to classical approaches with rule-based compo- nents, this enables human-like behavior in complex sce- narios and promises performance gains that scale with data. While E2E-AD exists in various ﬂavors, popular approaches [ 12 , 24 , 29 , 52 ] often implement modular but fully-differentiable transformer-based architectures with la- tent intermediate representations, such as bird’ s eye view (BEV) feature grids. Popularized on open-loop benchmarks Perception Planner BEV attention overfitting (a) High-capacity perceptual representation, e.g., high-resolution BEV path trajectory (b) Planning Representation point estimate diffusion samples (c) (Non-)generativ e Modeling Figure 1. Architectural Patter ns. (a) High-resolution BEV fea- tures facilitate perception tasks, but promote ov erﬁtting the plan- ner . (b) Closed-loop methods prefer path ov er trajectory represen- tations due to robust steering. (c) Point-estimates interpolate be- tween trajectory modes that diffusion-based sampling can breed. such as NuScenes [ 3 ], these works typically do not ev aluate in a closed-loop setting. Unfortunately , approaches opti- mized for open-loop performance [ 24 , 29 , 52 ] often fail to generalize in closed-loop dri ving scenarios [ 9 , 28 ], resulting in a div ergence in the directions of architectural adv ances between works that solely focus on either setting. In this paper, we take a step to wards consolidating these advances in E2E-AD with a focus on closed-loop driving. As depicted in Fig. 1 we systematically extend the design space proposed in ParaDriv e [ 52 ], by re-examining three architectural patterns: (1) The use of high-resolution per- ceptual representations as input to the planning module, 1 more common to the open-loop setting [ 24 , 34 ], (2) dis- entanglement of trajectories into lateral- and longitudinal components, mainly used in closed-loop dri ving [ 25 , 43 ] (3) the use of generativ e planners, previously undere xplored for end-to-end closed-loop driving [ 8 , 36 , 53 ]. By ev alu- ating these patterns jointly , we ﬁnd that only one conﬁg- uration admits robust scaling of performance. In particu- lar , we observe that high-resolution perceptual representa- tions, shown to enable state-of-the-art (SotA) performance in open-loop [ 24 , 29 ], can be susceptible to causal confu- sion [ 10 ], and introduce a spatial bottleneck to mitigate this. Furthermore, we show that disentangled trajectory repre- sentations and generative planning via diffusion [ 36 , 53 ], previously studied only in isolation, provide complemen- tary beneﬁts in modeling multi-modal beha vior , and see the strongest scaling properties when using both in conjunction. Building on these insights, we de v elop BevAD. It achiev es SotA closed-loop driving performance on the chal- lenging Bench2Driv e benchmark [ 28 ] based on the CARLA simulator [ 13 ], without any bells and whistles, and using camera sensors only . Additionally , we demonstrate that BevAD strongly beneﬁts from data scaling, with difﬁcult skills emerging as the dataset size increases. In summary , our core contributions are threefold, we: (1) Show that high-resolution perceptual representations can hinder learning rob ust planning and introduce a spatial bottleneck layer for mitigation. (2) Analyze ho w a disentangled planning representation and diffusion-based planning provide complementary beneﬁts for closed-loop driving. (3) Integrate these insights and b uild Be vAD, a lightweight and highly scalable E2E-AD architecture that achiev es SotA closed-loop driving on Bench2Dri v e. 2. Related W ork End-to-end A utonomous Driving . T raditional au- tonomous dri ving (AD) stacks integrate standalone modules through compact predeﬁned interfaces [ 1 ]. Although of- fering interpretability , this approach is prone to error accu- mulation, information loss between modules, and conﬂict- ing optimization objecti ves [ 24 ]. In contrast, the E2E-AD paradigm mitigates these limitations by optimizing the en- tire perception-to-planning pipeline, from raw sensor data to a planning trajectory or physical control commands [ 2 , 24 ]. UniAD [ 24 ] established a foundational frame work for modular E2E-AD by introducing query-based mechanisms to jointly optimize perception, prediction, and planning. A div erse landscape of stack-lev el designs has emerged, pri- marily categorized along three key dimensions: (1) T ask Selection. This encompasses tasks such as 3D object de- tection [ 47 ], multi-object tracking [ 12 , 24 ], online map- ping [ 24 , 29 , 47 , 52 ], motion prediction [ 24 , 29 ], occupanc y prediction [ 24 , 49 ], and increasingly , language-based tasks like commentary and visual question answering [ 15 , 43 ]; (2) Network T opology . Information ﬂow is governed by sequential [ 23 ], parallel [ 6 , 52 ] and hybrid [ 24 , 29 , 47 ] module placements or by uniﬁed transformers [ 48 ]; (3) Intermediate Representation. These include sparse (e.g., instance-lev el) [ 12 , 47 ] and dense (e.g., BEV -centric [ 52 ] or occupancy-based [ 49 ]) representations. Despite this exten- siv e architectural landscape, our study uncov ers a miscon- ception about a subtle, yet piv otal detail within many exist- ing designs: High-resolution intermediate representations, such as an increased number of BEV features, ha ve demon- strated improv ements in open-loop perception [ 34 ] and are posited to improve planning performance [ 28 ]. Con versely , we propose spatially compressing the perceptual represen- tation before planner input and sho w substantially enhanced closed-loop driving robustness. This ﬁnding corroborates observations that other successful closed-loop dri ving meth- ods often employ architectural simpliﬁcations [ 43 , 56 ], such as relying solely on single front-facing cameras. W e sys- tematically shed light on the importance of the perceptual representation design and observe that 360-degree percep- tion works if the intermediate representation is otherwise limited in bandwidth. Imitation Learning for Planning. Learning to plan with an autonomous vehicle can be broadly categorized into two paradigms, (1) imitation learning (IL) via behavior cloning and (2) reinforcement learning. Research on IL sur ged after pioneering work [ 2 , 7 ] and with the a v ailability of large au- tonomous driving datasets [ 3 , 31 ] and simulators for closed- loop testing, such as CARLA [ 13 ]. Conditioning the driv- ing policy on a signal that represents the driv er’ s intention in addition to en vironment observations has become a piv- otal framew ork [ 7 ], allowing control ov er the policy at test time with na vigation commands or target points. How- ev er , behavior cloning is prone to introducing undesired side-effects such as covariate shift [ 42 ] and causal confu- sion [ 10 ], which are hard to detect with open-loop based ev aluation schemes, ev en with hardened metrics [ 9 ]. Data augmentation [ 25 , 43 ] and world models [ 21 , 42 ] help mit- igate these effects, though they do not solve them entirely . Moreov er , scaling up training data for IL has been sho wn to improve planning, following a power-la w relationship in open-loop metrics, though these gains saturated in closed- loop driving [ 40 , 54 ]. Many current E2E-AD solutions operate as point es- timators, directly regressing the ﬁnal plan [ 6 , 47 , 56 ] in waypoint or action spaces, or by selecting from prede- ﬁned anchors [ 5 ]. Both approaches can further beneﬁt from trajectory post-processing [ 24 , 35 ]. An emerging al- ternativ e is generativ e models, particularly diffusion-based planners, which learn the distribution of future trajecto- ries conditioned on the scene and a gi ven command. At 2 test time, they sample from this conditional distribution. Such dif fusion-based planning can ﬁt complex human driv- ing distributions better than point estimators [ 30 , 53 ], fur- ther enabling test-time guidance with driving style prefer- ences [ 53 ]. T runcated schedules were proposed to mitigate the computational ov erhead from iterativ e denoising [ 36 ]. Despite demonstrating strong performance in open-loop benchmarks [ 36 ], dif fusion-based planners have seen lim- ited adoption in closed-loop methods. Moreover , they hav e not been inv estigated in prior data scaling studies [ 40 , 54 ]. This work systematically compares point estimators and diffusion-based planners in closed-loop driving, rev ealing superior data scalability for diffusion-based approaches. Open-loop E2E-AD methods often generate temporal waypoint trajectories [ 24 , 29 , 36 , 47 , 52 ], thereby entan- gling lateral and longitudinal control. This representation can lead to sparse and ambiguous supervision, particularly for dynamic, multi-modal intersection scenarios [ 25 ]. In contrast, SotA closed-loop methods fav or disentangled out- puts: a future path independent of time and a target speed for longitudinal control [ 25 , 43 , 48 , 56 ]. Both genera- tiv e modeling and disentangled planning representations are patterns for learning multi-modal futures, yet their interac- tion remains underexplored. Our study reveals strong syn- ergies for scalable and rob ust end-to-end learning. 3. Revisiting Common Ar chitectural Patter ns W e brieﬂy summarize common architectural patterns that were previously studied in isolation and occur predomi- nantly either in the open-loop or the closed-loop setting. 1. High-Resolution P er ceptual Repr esentations are em- ployed to impro ve performance in perception tasks [ 34 ], but are primarily studied in open-loop. While [ 28 ] pro- vides some evidence for beneﬁts in closed-loop driv- ing, leading closed-loop methods in CARLA inherently employ lower -capacity representations due to their re- duced sensor conﬁguration. W e aim to systematically re-examine the impact of the BEV size on learning pre- cise representations for closed-loop driving. 2. Disentangled Planning Repr esentations are employed by the top-three closed-loop methods in CARLA as a measure to reduce the ambiguity of multi-modal futures. 3. Generative Planners emerged as a principled measure to model multi-modality in dri ving, b ut their study is mainly driv en by open-loop methods. In the remainder of this section, we ﬁrst introduce our anal- ysis framework (Sec. 3.1 ), along with the experiment setup (Sec. 3.2 ). Subsequently , we analyze the impact of the per- ceptual representation (Sec. 3.3 ), the patterns for modeling multi-modal future jointly (Sec. 3.4 ) and the implications on scaling (Sec. 3.5 ). 3.1. Analysis Framework Our analysis framework as shown in Fig. 2 is built upon ParaDri ve [ 52 ] for the following key reasons: (1) ParaDri ve provides a systematic, module-lev el architecture for E2E- AD stacks, of fering a well-deﬁned foundation; (2) Its planner operates independently from auxiliary task, facil- itating a focused analysis of the perception-planner inter- face; (3) ParaDri ve’ s design, based on a real-world sen- sor conﬁguration, has demonstrated strong performance on open-loop nuScenes tasks, unlike CARLA-speciﬁc meth- ods. Our framew ork prioritizes training efﬁcienc y to scale beyond nuScenes [ 3 ] to larger imitation learning datasets for CARLA [ 43 , 56 ]. This is achieved through a stream- lined pipeline (Fig. 2a ), optimizing the BEV backbone and removing non-essential auxiliary tasks, while maintaining a realistic sensor setup. Key aspects of the components are stated belo w , further details are found in the supplementary . BEV Backbone. The BEV backbone processes images of N Cam = 6 cameras to produce BEV features F Bev with dimensions H × W , comprising RADIO [ 19 ] with low- rank adapter [ 22 ] as its image backbone and a BEV encoder based on BEVFormer [ 34 ]. Signiﬁcantly improved runtime is achieved by replacing the recurrent BEV feature genera- tion with cached features streamed from short episode snip- pets during training [ 51 ]. Furthermore, we introduce a nov el camera augmentation technique, applying a random trans- formation BEV T car to all camera extrinsics to recover from compounding errors during closed-loop inference. A uxiliary T asks. W e implement a DETR-style object de- coder [ 4 ] with deformable cross-attention [ 55 ] to supervise the BEV features during training [ 34 ]. W e pruned other auxiliary tasks as used in [ 24 , 52 ] as they did not demonstra- bly improve closed-loop performance during initial tests, but introduced signiﬁcant runtime o verhead. Planning Head. W e adopt a Transformer [ 50 ] decoder architecture as depicted in Fig. 2b : Self-attention among planning queries Q Plan enables mutual alignment, while cross-attention to scene tokens F Scene allows extraction of global scene features, following [ 24 , 52 ]. Subsequently , we employ a coarse-to-ﬁne strategy using an optional de- formable attention layer , reﬁning Q Plan by sampling local, high-resolution BEV features F Bev . Inspired by diffusion transformers [ 41 ], each multi-head attention block is en- closed by adaLN-Zero transformations, incorporating con- ditioning from high-lev el driving commands, the ego-state, and optionally the diffusion timestep. If the planner is a point estimator, the planning queries Q Plan are implemented as learnable embeddings. F or diffusion-based planners, Q Plan is generated by adding Gaussian noise to the ground truth according to a diffusion schedule such as DDIM [ 46 ] and embedding the result into the transformer’ s input space. By reinterpreting the planning queries as path and velocity 3 BEV Backbone Multi-View Images Parallel tasks 3D Object Detection Other Auxiliary T asks Planner modelling of multi- modality Scene T okenizer representation capacity T raining-only (a) Pipeline Overvie w high-resolution BEV masked scene tokens PE Masking Patchifying adaLN-Zero + Cross-Attention High-level Command learnable query Point-Estimate Diffusion noisy GT + adaLN-Zero + Cross-Attention Planning Output Scene T okenizer Planner adaLN-Zero + Self-Attention or (Optional) Deformable Attention (b) Proposed Scene T okenizer and Planning Head Figure 2. Analysis Framework. (a) W e build our analysis frame work on P araDriv e [ 52 ]. (b) W e introduce a scene tokenizer to reduce the spatial resolution of the BEV features. The design of our planning head is based on a diffusion transformer [ 41 ]. Crucially , the choice of the planning queries determines whether the planner is modeled as a point estimator or by diffusion. tokens instead of trajectory tokens and adjusting supervi- sion accordingly , we can modify the planning representa- tion. This ﬂexible design enables analysis across different formulations without altering the planner’ s architecture. Controller . Following [ 25 , 28 , 43 ], we employ two PID controllers to conv ert planning outputs into steering and ac- celeration commands. The disentangled planning represen- tation facilitates PID controller design by allo wing separate processing of path and speed [ 25 ]. T o achieve the same for the trajectory representation, we ﬁt a piecewise cubic Her- mite polynomial to the temporal waypoints and interpolate at ﬁxed distances, while speed is derived using a second- order difference quotient. This allows consistent PID con- troller parameters across representations, minimizing the controller’ s critical impact on closed-loop driving. 3.2. Experiment Setup Data Collection. There are currently two popular data sources for expert demonstrations in CARLA [ 13 ] for train- ing imitation learning models: (1) Bench2Dri ve [ 28 ] pro- vides a dataset of expert demonstrations collected by the privile ged, RL-based Think2Driv e [ 33 ] expert along with sensor data and object annotations. (2) The of ﬁcial CARLA leaderboard 2.0 benchmark provides speciﬁcations of long routes in CARLA with scenarios alongside, from which a dataset of expert demonstrations can be collected with the privile ged, rule-based expert PDM-lite [ 44 , 56 ]. Sim- lingo [ 43 ] and TF++ [ 56 ] split the long routes into shorter segments, each containing one scenario, and uniformly up- sample routes with rare scenarios. Due to v arious known label b ugs in the Bench2Dri ve dataset, we adopt the second approach. W e re-collect training data for our six-camera sensor setup using the same route speciﬁcations as Simlingo and use these routes for training, unless stated otherwise. T raining. W e conduct all experiments on 8xA100 80GB GPUs with a total batch size of 128 in mixed-precision (bﬂoat16) to balance efﬁciency , memory usage and stabil- ity . AdamW [ 38 ] Schedule-free [ 11 ] (learning rate: 2 − 4 ; weight decay: 0 . 01 ) is used for optimization. Our training consists of two stages: A warm-up stage over four epochs to initialize the BEV backbone with perception supervision, followed by a second stage that adds planning supervision. For faster con vergence, we freeze the BEV backbone for all second-stage experiments e xcept for studies on data scale. Benchmark and Metrics. W e perform closed-loop ev alu- ations on the challenging Bench2Drive benchmark [ 28 ] in CARLA [ 13 ]. Bench2Driv e comprises 220 short test routes, each featuring a single scenario, enabling analysis of spe- ciﬁc dri ving skills. W e report the ofﬁcial metrics driving score (DS) and success rate (SR). 3.3. High-Resolution Per ceptual Repr esentations Established BEV -based end-to-end architectures connect perception and planning through H × W high-resolution latent BEV features [ 24 , 52 ]. W e introduce a tokenizer (Fig. 2b ) that applies masking and patchifying to compress BEV features F Bev into scene tokens F Scene , thereby chan- neling spatial information through a bottleneck. Masking. W e propose using a key padding mask in the global cross-attention of the planner to exclude BEV cells where planning queries Q Plan cannot attend to. Our initial experiments tested various masking strategies, such as re- 4 moving distant parts to the left and right of the ego vehi- cle, and sophisticated masks based on the map segmenta- tion outputs. No signiﬁcant differences were observed, so we use the simplest form, masking out 20% of the left- and right-most BEV cells. Although tailored to CARLA maps, this approach helps to determine if restricting the attention space facilitates learning a rob ust representation. Patchifying. Inspired by V ision Transformers [ 14 ], we propose pixel unshufﬂing for combining patches of p × p BEV features F Bev , (pixels) into spatial scene tokens F Scene , which our planner can globally attend to. W e prev ent the channel dimension of the scene tokens from growing by p 2 by projecting the output of pixel unshufﬂing to a lower - dimensional space, thereby enforcing a bottleneck. W e ex- plore p ∈ { 1 , 2 , 4 , 5 } . This analysis aims to understand ho w forced compression and sequence length in cross-attention impact learning a robust representation. Results. W e employ a 100 × 100 BEV space, con- sistent with UniAD-tiny [ 24 , 28 ], and a disentangled point-estimator planner [ 25 ], aligning with SoT A on Bench2Driv e [ 43 , 56 ]. T ab . 1 presents open- and closed- loop dri ving metrics for the tokenizer design space. W e ob- serve signiﬁcant improvements in closed-loop driving per- formance as the scene token count is reduced via masking and patchifying. Speciﬁcally , restricting the planner’ s atten- tion to masked BEV features enhances closed-loop driving, ev en when the mask is applied solely at test time. Further- more, summarizing p × p BEV feature patches into scene tokens reduces the planner’ s token count by a factor of p 2 , yielding substantial closed-loop performance gains. De- spite the reduced BEV resolution, the L1 trajectory error marginally impro ves for p ≤ 4 . Howe ver , this compression strategy collapses for p ≥ 5 , resulting in a signiﬁcant drop in both closed-loop and open-loop performance. Discussion. Transformer -based models are known to strug- gle with identifying relev ant information in long (text) se- quences, even with modest token counts [ 32 ]. W e relate this challenge to our setting, where high-resolution BEV inputs create long attention contexts. W e hypothesize that the plan- ner overﬁts to spurious correlations in training data by de- riving actions from memorized visual landmarks. Fig. 3 vi- sualizes qualitativ e examples of planning query mean cross- attention activ ations. In the absence of masking and patch- ing, it re veals numerous punctual, high activ ation patterns in distant, often occluded or irrelev ant BEV regions, strongly indicating causal confusion. These learned shortcuts are not measurable by open-loop metrics like L1 due to av eraging, but lead to catastrophic failures in distinct situations at test time. By reducing the token count through masking and patchifying, our approach mitigates this causal confusion, signiﬁcantly enhancing closed-loop dri ving by learning a more robust representation for test time. Mask Patch Size Scene T okens DS ↑ SR ↑ L1 (m) ↓ ✗ 1 100 × 100 66.86 36.36 1.45 ✓ † 1 100 × 60 71.79 41.37 1.45 ✓ 1 100 × 60 72.40 41.97 1.53 ✓ 2 50 × 30 74.98 48.18 1.43 ✓ 4 25 × 15 82.62 57.43 1.43 ✓ 5 20 × 12 66.44 40.91 1.73 T able 1. Impact of T okenizer Design. Masking : Restricting plan- ning queries’ attention beneﬁts closed-loop driving. P atchifying : Aggregating p × p BEV features into scene tokens signiﬁcantly enhances closed-loop driving for p ≤ 4 . Notably , these closed- loop improv ements are not reﬂected in the open-loop L1 trajectory error metric. Le gend: † indicates test-time masking. Our ﬁnding contrasts with prior studies suggesting that higher BEV resolutions enhance do wnstream tasks such as 3D object detection [ 34 ]. This discrepancy stems from a fundamental difference between local detection and global planning tasks. DeformableDETR-style detection heads lev erage object locality by decoding queries to speciﬁc ref- erence points [ 34 , 55 ]. While increased BEV resolution enhances localization precision, it does not expand a sin- gle query’ s receptive ﬁeld. In contrast, planning requires understanding critical scene elements that may not be local- ized near the immediate trajectory , thus necessitating global cross-attention [ 24 , 52 ]. In this global context, increas- ing BEV resolution e xpands the attention conte xt size, con- tributing to the observ ed performance degradation. 3.4. Modeling Multi-Modal Beha vior The problem of inherent multi-modality in driving beha v- ior is well-known in research [ 25 , 53 ]. Leading closed-loop methods in CARLA address this with a disentangled output representation that separates the spatial path from the speed proﬁle instead of entangling them in a trajectory of tempo- ral waypoints [ 43 , 48 , 56 ]. Points on the path are obtained by sampling at ﬁxed distances instead of ﬁxed time inter- vals, which were shown to be less ambiguous, providing better supervision [ 25 ]. Meanwhile, diffusion models [ 20 ] can nativ ely address the multi-modality in entangled tempo- ral trajectories with generativ e modeling [ 36 , 53 ]. On ﬁrst glance, both patterns appear to solve a similar problem. T o discern the individual contributions and potential synergies of trajectory representation and (non)generati ve modeling, we systematically ev aluate all four combinations. As we ob- serve that the DS and SR tend to obscure the distinct symp- toms of driving failures, we additionally introduce static and dynamic infraction rates IR s and IR d for this experiment. In a nutshell, IR s and IR d capture the pre v alence of failures due to wrong path planning and inappropriate acceleration respectiv ely; details can be found in the supplementary . 5 v = 8.2 m/s (a) p = 1 , without masking v = 0 m/s (b) p = 1 , with masking v = 0 m/s (c) p = 4 , with masking Figure 3. Qualitative visualization of the planning queries’ cross-attention to BEV features. Fig. 3a . The planner attends to distant BEV cells. Despite strong attention on the trafﬁc light, the autonomous vehicle runs the red light. Fig. 3b : There are numerous attention spikes to random BEV cells, b ut barely no attention to the oncoming traf ﬁc. Fig. 3c : The attention map signiﬁcantly simpliﬁes and exhibits fewer attention outliers. Model Repr . DS ↑ SR ↑ IR s ↓ IR d ↓ PE T 77.2 ± 0 . 6 51.7 ± 0 . 7 0.185 0.505 PE P+S 82.6 ± 1 . 0 57.4 ± 0 . 6 0.055 0.447 DI T 80.7 ± 3 . 0 56.2 ± 4 . 7 0.147 0.391 DI P+S 81.8 ± 1 . 5 59.4 ± 1 . 0 0.094 0.423 T able 2. Comparison of modeling and planning repr esentation. Le gend: PE: point-estimator, DI: diffusion, T : trajectory (entan- gled), P+S: path and speed (disentangled) representation. Results. As shown in T ab . 2 the disentangled representa- tion signiﬁcantly reduces static infractions, regardless of the modeling approach. Particularly for point-estimators, this reﬂects strongly in the ov erall closed-loop scores, matching prior studies [ 25 ]. W e conclude that the disentangled rep- resentation is fav orable for learning robust steering. Gener- ativ e modeling with diffusion reduces dynamic infractions, regardless of trajectory representation. As a result, the en- tangled dif fusion-based variant achiev es similar ov erall SR than the disentangled point-estimator, though their failure modes are quite different. Further , we observe comple- mentary beneﬁts for employing both diffusion-based mod- eling and disentangled representation, stated with the high- est overall SR. The lower driving score stems from -1.6% route completion, since the diffusion model is less willing to make an infraction for the sake of route progress. 3.5. Diminishing Returns when Scaling Non- Generative Planners The promise of scaling performance with data is one of the main advantages of E2E-AD. Since generative planning can capture the full distribution of behavior , we hypothesize that it shows stronger beneﬁts from scaling the dataset size. In the following, we hence examine the scaling behavior of diffusion- compared to point estimator -based planning. Scaling Data. T o scale training data beyond current datasets [ 28 , 43 ], we build a route generator that exhaus- tiv ely plans short semantically plausible single scenario routes in all CARLA to wns. While being capable of build- ing > 10 6 unique route-scenario combinations (see supple- mentary), we only consider 8,000 uniformly sampled sce- narios as additional training data in our scaling experiments. For conducting data scaling experiments, we lev erage estab- lished protocols from [ 40 , 54 ]: W e create ﬁ ve training splits from the joint set of Simlingo’ s [ 43 ] and our routes, each approximately doubling in size. These splits are cumulati ve (each being a subset of the larger ones [ 40 ]), maintaining the same scenario distribution across all splits. W e train all models on each data scaling split until con ver gence. Results. As reported in Fig. 4 , both v ariants improve mono- tonically in SR as we double the training dataset. In the low data regime, the point estimator slightly outperforms the dif fusion-based planner . After an inﬂection point (about 8000 training scenes), the growth rate decelerates, match- ing prior studies on closed-loop scaling laws for point es- timate planners [ 54 ]. On the other hand, diffusion-based planning maintains its linear rate of improvement until our largest data scaling split, and thereby substantially outper- forms the point-estimator counterpart. Interestingly , we cannot observe any saturation for the diffusion-based plan- ner , unlike [ 40 , 54 ] reported for closed-loop tests with point-estimator regressors. This opens up opportunities for further improv ements in the presence of larger datasets. 6 1000 2000 4000 8000 16000 T raining Dataset Size 30 40 50 60 70 80 90 100 Scor e Driving Scor e (Diffusion) Success R ate (Diffusion) Driving Scor e (P oint Estimator) Success R ate (P oint Estimator) Figure 4. Scaling Properties. Dif fusion demonstrates superior performance over point estimators when scaled with sufﬁcient training data, despite initially underperforming with limited data. Emerging Skills. The scaling gains can also be broken down in terms of multi-ability ev aluation protocol from Bench2Driv e [ 28 ], where we observe that difﬁcult skills emerge at larger training split sizes. For example, this is the case for the Give W ay and Merging skills as required for the yielding scenario depicted in Fig. 5 . Detailed results can be found in the supplementary . 4. BevAD Integrating the abov e insights, BevAD emer ges as a lightweight and highly scalable E2E-AD architecture from our analysis frame work in Fig. 2 . It applies synergies of architectural patterns, pre viously studied in isolation, for dealing with multi-modality in dri ving and combats overﬁt- ting with effecti ve BEV compression. W e compare BevAD to previous state-of-the-art in CARLA [ 13 ], providing a quantitativ e demonstration of BevAD’ s results (Sec. 4.1 ) along with qualitative results (Sec. 4.2 ) and real-world ex- periments on N A VSIM [ 9 ] (Sec. 4.3 ). 4.1. Comparison to State of the Art A comprehensi ve comparison of BevAD to other meth- ods on Bench2Driv e with respect to training data, sen- sor conﬁguration, supervision signals and performance can be found in T ab . 3 . For a fair comparison, our model is denoted as BevAD-S when trained solely on Simlingo routes [ 43 ], and BevAD-M when trained with additional routes from our scaling study . W e consider UniAD [ 24 ] and V AD [ 29 ] as baselines since their module-level ar - chitecture is most similar to BevAD. W e report the ov er- all closed-loop driving score and success rate on the 220 test routes of Bench2Drive [ 28 ]. The signiﬁcant improv e- ments of +34.8 DS and +38.9 SR of BevAD-S compared to UniAD highlight the effecti v eness of our tokenization as well as the complementary beneﬁts of disentangled output representation and diffusion-based policy . By uniformly scaling up training scenarios, BevAD-M outperforms all prior methods in terms of DS and SR, as well as the con- current BridgeDri ve [ 37 ] in terms of DS. W e refer to the supplementary for the more ﬁne-grained multi-ability ev al- uation [ 28 ] and analysis on driving skill e v olution. 4.2. Qualitative Results In the challenging Y ieldT oEmergencyV ehicle scenario, BevAD demonstrates the ability to yield to a rapidly ap- proaching emergency vehicle from behind. Fig. 5 illustrates that BevAD acquires this skill after scaling up training data. Prior methods failed in such scenarios, either due to the lack of 360-degree camera perception [ 43 ] or insufﬁcient train- ing data [ 48 ]. This underscores BevAD’ s effecti ve utiliza- tion of its surrounding vie w BEV perception and its scal- ability . Additional qualitati ve closed-loop demonstrations are provided in the supplementary material. ∀ t (a) BevAD-S t 1 = 10s t 2 = 14s (b) BevAD-M Figure 5. Yield to Emergency V ehicle. By increasing the training dataset size, BevAD-M learns to yield to emergenc y vehicles (red) on highways by safely merging into slower trafﬁc. This cability is absent at smaller data scales (BevAD-S) and in prior leading closed-loop methods[ 43 , 48 ]. Failur e Cases. W e analyze common failure modes of BevAD-M: (1) Red Light Infractions occur in 19% of un- successful closed-loop runs. For example, the driving model runs a red light in the PedestrianCrossing scenario after pedestrians hav e crossed, suggesting causal confu- sion. (2) Route Deviations occur when BevAD ignores lane change commands, causing incorrect exits on multi- lane roads. W e attribute this to weak conditioning signals from na vigation commands, which are often insufﬁcient for timely lane changes. Strengthening conditioning with target points can mitigate this issue by guiding the model to wards the correct lane center similar to [ 25 ], though it increases re- liance on precise map localization. (3) Miscellaneous Col- 7 Method Details Overall Expert Sensors Labels Driving Score ↑ Success Rate ↑ V AD [ 29 ] Think2Driv e 6x CAM O, M 42.35 15.00 UniAD [ 24 ] Think2Driv e 6x CAM O, M 45.81 16.36 ThinkT wice [ 27 ] Think2Driv e 6x CAM, LiD AR O 62.44 31.23 Driv eAdapter [ 26 ] Think2Driv e 6x CAM, LiD AR O 64.22 33.08 Hydra-NeXt [ 35 ] Think2Driv e 2x CAM - 73.86 50.00 Orion [ 15 ] Think2Driv e 6x CAM O, L 77.74 54.62 TF++ [ 56 ] PDM-lite 1x CAM, LiD AR O, M, S, D 84.21 67.27 Simlingo [ 43 ] PDM-lite 1x CAM L 85.07 ± 0 . 95 67.27 ± 2 . 11 Hip-AD [ 48 ] Think2Driv e 6x CAM O, M 86.77 69.09 BridgeDriv e † [ 37 ] PDM-lite 1x CAM, LiD AR O, M, S, D 86.87 72.27 BevAD-S (our s) PDM-lite 6x CAM O 80.63 ± 1 . 76 55.30 ± 2 . 63 BevAD-M (our s) PDM-lite 6x CAM O 88.11 ± 0 . 98 72.73 ± 1 . 98 T able 3. Closed-loop Results on Bench2Drive. Despite its simpler design BevAD outperforms previous modular baselines UniAD and V AD by a large margin, reaching SOT A-level performance. W e highlight that BevAD can gain further substantial driving performance by uniformly scaling up training data. Leg end: O: 3D Object Detection, M: Map, S: Semantic Segmentation, D: Depth, L: Language, † : concurrent work. If available, we report mean and standard de viation ov er three seeds to account for the randomness in CARLA. lisions result from delayed reactions in time-critical scenar - ios or occur in situations that inv olve strong interaction with other vehicles, such as mer ging into ﬂows. 4.3. Real-world Experiments W e ev aluate our method’ s real-world applicability on the N A VSIM planning benchmark [ 9 ]. T o match NA VSIM’ s expected planning representation, we adapt the diffusion planner to predict trajectories with associated yaw angles ov er a four-second horizon, and train BevAD end-to-end on the navtrain split for eight epochs. W e summarize performance on the navtest split in T ab. 4 , using the of- ﬁcial N A VSIM metrics. BevAD outperforms representativ e baselines UniAD [ 24 ] and ParaDri ve [ 52 ] by 3.2 and 2.6 PDMS, respectiv ely , primarily due to improvements in dri v- able area compliance (D A C) and ego-progress (EP). No- tably , BevAD achiev es this performance with only object detection and planning supervision, in contrast to baselines that also leverage online-mapping and occupancy predic- tion supervision. BevAD’ s lightweight design yields a 570 GPU-hour (A100-80GB) training compute budget, 10x less than ParaDri ve [ 9 ]. Furthermore, we ablate the tokenizer design on real- world data: As shown in T ab . 4 , removing masking de- grades overall performance by 0.7 PDMS, while removing patchifying ( p = 1 ) results in a 1.0 PDMS degradation. This demonstrates the ef fectiv e generalization of our mask- ing and tokenizing scheme to a real-world setting. 5. Conclusion and Limitations W e presented BevAD, a lightweight and highly scalable E2E-AD model that achie ves SotA closed-loop driving on Method NC ↑ D A C ↑ TTC ↑ EP ↑ PDMS ↑ UniAD [ 24 ] 97.8 91.9 92.9 78.8 83.4 ParaDri ve [ 52 ] 97.9 92.4 93.0 79.3 84.0 BevAD (our s) 98.1 95.3 94.5 80.5 86.6 BevAD w/o mask 97.8 94.9 93.7 80.3 85.9 (-0.7) BevAD w/o patchifying 97.9 94.4 94.4 79.7 85.6 (-1.0) T able 4. Real-world Results on NA VSIM. Performance com- parison of BevAD against baseline methods on the real-world N A VSIM benchmark ( navtest ), including key ablations. Leg- end: NC: no at-fault collision, D A C: driv able area compliance, TTC: time-to-collsion, EP: ego progress, PDMS: PDM score. Bench2Driv e. BevAD emerges from our systematic analy- sis of common architectural patterns, previously studied in isolation. W e show that high-resolution BEV features can lead to overﬁtting, which we mitigate by forcing the planner to learn bottleneck. Additionally , planning with diffusion complements disentangled planning output representations, particularly excelling when scaled with data. W e ackno wledge sev eral limitations. First, while com- pressing the BEV along its spatial dimension signiﬁcantly improv ed closed-loop driving, our approach may not di- rectly extend to high-speed highway scenarios, which re- quire long-range perception. A principled, conte xt-adaptiv e BEV masking strategy remains for future work. Second, our analysis of failure cases suggests potential causal con- fusions. Mitigating these, perhaps via incorporating world knowledge from VLMs or with reinforcement learning, re- quires further in vestigation. 8 Acknowledgments. This work is a result of the joint re- search project ST ADT :up (19A22006O). The project is sup- ported by the German Federal Ministry for Economic Af- fairs and Energy (BMWE), based on a decision of the Ger- man Bundestag. The authors are solely responsible for the content of this publication. References [1] Sagar Behere and Martin T orngren. A functional architecture for autonomous driving. In F irst International W orkshop on Automotive Softwar e Ar c hitecture (W ASA) , 2015. 2 [2] Mariusz Bojarski, Davide Del T esta, Daniel Dworako wski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathe w Monfort, Urs Muller , Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars, 2016. 2 [3] Holger Caesar, V arun Bankiti, Alex H. Lang, Sourabh V ora, V enice Erin Liong, Qiang Xu, Anush Krishnan, Y u Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In IEEE/CVF Confer- ence on Computer V ision and P attern Recognition (CVPR) , 2020. 1 , 2 , 3 [4] Nicolas Carion, Francisco Massa, Gabriel Synnaev e, Nico- las Usunier, Herv ´ e J ´ egou, and Justin Ponce. End-to-end ob- ject detection with transformers. In Eur opean Conference on Computer V ision (ECCV) , 2020. 3 [5] Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, W enyu Liu, and Xinggang W ang. V adv2: End-to-end vectorized autonomous driving via probabilistic planning. ArXiv , abs/2402.13243, 2024. 2 [6] Kashyap Chitta, Aditya Prakash, Bernhard Jae ger , Zehao Y u, Katrin Renz, and Andreas Geiger . T ransfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE T ransactions on P attern Analysis and Machine Intelligence , 45(11):12878–12895, 2023. 2 [7] Felipe Codevilla, Matthias Miiller , Antonio L ´ opez, Vladlen K oltun, and Alexe y Dosovitskiy . End-to-end driving via con- ditional imitation learning. In IEEE International Confer- ence on Robotics and Automation (ICRA) , 2018. 2 [8] Alexander Cui, Sergio Casas, Abbas Sadat, Renjie Liao, and Raquel Urtasun. Lookout: Di verse multi-future prediction and planning for self-driving. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2021. 2 [9] Daniel Dauner, Marcel Hallgarten, Tian yu Li, Xinshuo W eng, Zhiyu Huang, Zetong Y ang, Hongyang Li, Igor Gilitschenski, Boris Ivano vic, Marco Pav one, Andreas Geiger , and Kashyap Chitta. Navsim: data-dri ven non- reactiv e autonomous vehicle simulation and benchmarking. In 38th International Conference on Neur al Information Pro- cessing Systems , 2024. 1 , 2 , 7 , 8 [10] Pim de Haan, Dinesh Jayaraman, and Sergey Le vine. Causal confusion in imitation learning. In 33r d International Con- fer ence on Neural Information Pr ocessing Systems , 2019. 2 [11] Aaron Defazio, Xingyu (Alice) Y ang, Harsh Mehta, Kon- stantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky . The road less scheduled. In 38th International Confer ence on Neural Information Pr ocessing Systems , 2024. 4 [12] Simon Doll, Niklas Hanselmann, Lukas Schneider , Richard Schulz, Marius Cordts, Markus Enzweiler , and Hendrik P .A. Lensch. Dualad: Disentangling the dynamic and static world for end-to-end driving. In IEEE/CVF Conference on Com- puter V ision and P attern Recognition (CVPR) , 2024. 1 , 2 [13] Alex ey Dosovitskiy , German Ros, Felipe Code villa, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driv- ing simulator . In 1st Annual Confer ence on Robot Learning , 2017. 2 , 4 , 7 , 1 [14] Alex ey Dosovitskiy , Lucas Beyer , Alexander Kolesniko v , Dirk W eissenborn, Xiaohua Zhai, Thomas Unterthiner , Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly , Jakob Uszkoreit, and Neil Houlsby . An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Repr esenta- tions , 2021. 5 [15] Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing W ang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation. In IEEE/CVF International Con- fer ence on Computer V ision (ICCV) , 2025. 2 , 8 , 6 [16] Simon Gerstenecker , Andreas Geiger, and Katrin Renz. Plant 2.0: Exposing biases and structural ﬂaws in closed-loop dri v- ing, 2025. 1 [17] T engda Han, Dilara Gokay , Joseph He yward, Chuhan Zhang, Daniel Zoran, V iorica P ˘ atr ˘ aucean, Jo ˜ ao Carreira, Dima Damen, and Andrew Zisserman. Learning from streaming video with orthogonal gradients. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 2 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. 2 [19] Greg Heinrich, Mike Ranzinger , Hongxu Danny Y in, Y ao Lu, Jan Kautz, Andrew T ao, Bryan Catanzaro, and Pavlo Molchanov . Radiov2.5: Improved baselines for agglomer- ativ e vision foundation models. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 3 , 2 [20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In 34th International Conference on Neural Information Pr ocessing Systems , 2020. 5 , 3 [21] Anthony Hu, Gianluca Corrado, Nicolas Grifﬁths, Zak Murez, Corina Gurau, Hudson Y eo, Alex K endall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learn- ing for urban driving. In 36th International Conference on Neural Information Pr ocessing Systems , 2022. 2 [22] Edward J Hu, yelong shen, Phillip W allis, Zeyuan Allen- Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. LoRA: Low-rank adaptation of large language models. In In- ternational Conference on Learning Repr esentations , 2022. 3 , 2 [23] Shengchao Hu, Li Chen, Penghao W u, Hongyang Li, Junchi Y an, and Dacheng T ao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In Eur opean Confer ence on Computer V ision (ECCV) , 2022. 2 9 [24] Y ihan Hu, Jiazhi Y ang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, W enhai W ang, Le wei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Y u Qiao, and Hongyang Li. Planning-oriented autonomous dri v- ing. In IEEE/CVF Conference on Computer V ision and P at- tern Recognition (CVPR) , 2023. 1 , 2 , 3 , 4 , 5 , 7 , 8 , 6 [25] Bernhard Jaeger , Kashyap Chitta, and Andreas Geiger . Hid- den biases of end-to-end driving models. In IEEE/CVF In- ternational Conference on Computer V ision (ICCV) , 2023. 2 , 3 , 4 , 5 , 6 , 7 [26] Xiaosong Jia, Y ulu Gao, Li Chen, Junchi Y an, Patrick Langechuan Liu, and Hongyang Li. Driv eadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous dri ving. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 8 , 6 [27] Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Y an, and Hongyang Li. Think twice before driv- ing: T ow ards scalable decoders for end-to-end autonomous driving. In IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2023. 8 , 6 [28] Xiaosong Jia, Zhenjie Y ang, Qifeng Li, Zhiyuan Zhang, and Junchi Y an. Bench2drive: towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving. In 38th International Conference on Neural Information Pr ocessing Systems Datasets and Benc hmarks T r ack , 2024. 1 , 2 , 3 , 4 , 5 , 6 , 7 [29] Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, W enyu Liu, Chang Huang, and Xinggang W ang. V ad: V ectorized scene representation for efﬁcient autonomous driving. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 1 , 2 , 3 , 7 , 8 , 6 [30] Chiyu Max Jiang, Y ijing Bai, Andre Cornman, Christo- pher Davis, Xiukun Huang, Hong Jeon, Sakshum Kul- shrestha, John Lambert, Shuangyu Li, Xuanyu Zhou, Car- los Fuertes, Chang Y uan, Mingxing T an, Yin Zhou, and Dragomir Anguelov . Scenediffuser: efﬁcient and control- lable driving simulation initialization and rollout. In 38th International Conference on Neural Information Pr ocessing Systems , 2024. 3 [31] Napat Karnchanachari, Dimitris Geromichalos, K ok Seang T an, Nanxiang Li, Christopher Eriksen, Shakiba Y aghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Y iluan Guo, and Holger Caesar . T o wards learning- based planning: The nuplan benchmark for real-world au- tonomous driving. In IEEE International Conference on Robotics and Automation (ICRA) , 2024. 2 [32] Y uri Kuratov , A ydar Bulatov , Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev . BABILong: T esting the limits of LLMs with long context reasoning-in-a-haystack. In The Thirty-eight Con- fer ence on Neural Information Pr ocessing Systems Datasets and Benchmarks T rack , 2024. 5 [33] Qifeng Li, Xiaosong Jia, Shaobo W ang, and Junchi Y an. Think2driv e: Efﬁcient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2). In Eur opean Conference on Computer V ision (ECCV) , 2024. 4 , 1 [34] Zhiqi Li, W enhai W ang, Hongyang Li, Enze Xie, Chonghao Fang, Delong Lu, Xiaowei Zhu, and Gang Y u. Bevformer: Learning bird’ s-eye-vie w representation from multi-camera images via spatiotemporal transformers. In Eur opean Con- fer ence on Computer V ision (ECCV) , 2022. 2 , 3 , 5 , 4 [35] Zhenxin Li, Shihao W ang, Shiyi Lan, Zhiding Y u, Zuxuan W u, and Jose M Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2025. 2 , 8 , 6 [36] Bencheng Liao, Shao yu Chen, Haoran Y in, Bo Jiang, Cheng W ang, Sixu Y an, Xinbang Zhang, Xiangyu Li, Y ing Zhang, Qian Zhang, and Xinggang W ang. Dif fusiondriv e: Trun- cated diffusion model for end-to-end autonomous driving. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 2 , 3 , 5 , 4 [37] Shu Liu, W enlin Chen, W eihao Li, Zheng W ang, Lijin Y ang, Jianing Huang, Y ipin Zhang, Zhongzhan Huang, Ze Cheng, and Hao Y ang. Bridgedriv e: Diffusion bridge policy for closed-loop trajectory planning in autonomous driving, 2025. 7 , 8 , 1 , 2 , 4 , 6 [38] Ilya Loshchilov and Frank Hutter . Decoupled weight de- cay regularization. In International Conference on Learning Repr esentations , 2017. 4 [39] Calvin Luo. Understanding diffusion models: A uniﬁed per- spectiv e, 2022. 3 [40] Alexander Naumann, Xunjiang Gu, T olga Dimlioglu, Mar- iusz Bojarski, Alperen Degirmenci, Alexander Popov , De- vansh Bisla, Marco Pa vone, Urs M ¨ uller , and Boris Ivanovic. Data scaling laws for end-to-end autonomous driving. In IEEE/CVF Conference on Computer V ision and P attern Recognition W orkshops (CVPR W) , 2025. 2 , 3 , 6 [41] W illiam Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer V ision (ICCV) , 2023. 3 , 4 [42] Alexander Popov , Alperen Degirmenci, David W ehr , Shashank Hegde, Ryan Oldja, Alexey Kamenev , Bertrand Douillard, David Nist ´ er , Urs Muller , Ruchi Bharga va, Stan Birchﬁeld, and Nikolai Smolyanskiy . Mitigating covariate shift in imitation learning for autonomous vehicles using la- tent space generative world models. In IEEE International Confer ence on Robotics and Automation (ICRA) , 2025. 2 [43] Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: V ision-only closed-loop autonomous dri ving with language-action alignment. In IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025. 2 , 3 , 4 , 5 , 6 , 7 , 8 , 1 [44] Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger , Ping Luo, Andreas Geiger , and Hongyang Li. Dri veLM: Driving with graph visual question answering. In European Confer- ence on Computer V ision (ECCV) , 2025. 4 , 1 [45] Jascha Sohl-Dickstein, Eric W eiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Pr oceedings of the 32nd International Confer ence on Machine Learning , 2015. 3 [46] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Confer ence on Learning Repr esentations , 2021. 3 , 4 10 [47] W enchao Sun, Xuewu Lin, Y ining Shi, Chuang Zhang, Haoran W u, and Sifa Zheng. Sparsedriv e: End-to-end autonomous driving via sparse scene representation. In IEEE International Confer ence on Robotics and A utomation (ICRA) , 2024. 2 , 3 [48] Y ingqi T ang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous dri ving in a single decoder . In IEEE/CVF International Confer ence on Com- puter V ision (ICCV) , 2025. 2 , 3 , 5 , 7 , 8 , 6 [49] W enwen T ong, Chonghao Sima, T ai W ang, Li Chen, Silei W u, Hanming Deng, Y i Gu, Lewei Lu, Ping Luo, Dahua Lin, and Hongyang Li. Scene as occupancy . In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 2 [50] Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Il- lia Polosukhin. Attention is all you need. In 31st Inter- national Conference on Neural Information Pr ocessing Sys- tems , 2017. 3 [51] Shihao W ang, Y ingfei Liu, Tiancai W ang, Y ing Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efﬁcient multi-view 3d object detection. In IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. 3 , 2 [52] Xinshuo W eng, Boris Ivanovic, Y an W ang, Y ue W ang, and Marco P av one. Para-driv e: Parallelized architecture for real- time autonomous driving. In IEEE/CVF Confer ence on Com- puter V ision and P attern Recognition (CVPR) , 2024. 1 , 2 , 3 , 4 , 5 , 8 [53] Y inan Zheng, Ruiming Liang, K exin ZHENG, Jinliang Zheng, Liyuan Mao, Jianxiong Li, W eihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous dri ving with ﬂexi- ble guidance. In International Conference on Learning Rep- r esentations , 2025. 2 , 3 , 5 [54] Y upeng Zheng, Pengxuan Y ang, Zhongpu Xia, Qichao Zhang, Y uhang Zheng, Songen Gu, Bu Jin, T eng Zhang, Ben Lu, Chao Han, Xianpeng Lang, and Dongbin Zhao. Data scaling laws for imitation learning-based end-to-end autonomous driving, 2025. 2 , 3 , 6 [55] Xizhou Zhu, W eijie Su, Lewei Lu, Bin Li, Xiaogang W ang, and Jifeng Dai. Deformable DETR: Deformable transform- ers for end-to-end object detection. In International Confer- ence on Learning Repr esentations , 2021. 3 , 5 [56] Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger , An- dreas Geiger, and Kashyap Chitta. Hidden biases of end-to- end driving datasets, 2024. 2 , 3 , 4 , 5 , 8 , 1 11 What Matters f or Scalable and Rob ust Learning in End-to-End Driving Planners? Supplementary Material This supplementary material details our data scaling pro- cedure (Sec. A ), presents implementation details for our analysis framework and BevAD (Sec. B ), and includes ad- ditional quantitativ e and qualitati ve results (Sec. C ). A. Dataset A.1. Bench2Drive Bench2Driv e [ 28 ] of fers expert demonstrations from the Think2Driv e expert [ 33 ], comprising over 13,000 episodes for training. Howe ver , the adoption of the dataset within the community is limited due to missing sensor modalities and language annotations [ 37 , 43 , 56 ]. Furthermore, we identify signiﬁcant 3D label ﬂaws, as depicted in Fig. 6 : (1) Unla- beled parked vehicles in all CARLA towns except T own12 and T own13, (2) Incomplete annotations for multi-lightbox trafﬁc signals. (3) Misplaced bounding boxes for numer- ous trafﬁc signs and pedestrians. These ﬂaws create con- tradictory object detection supervision and enable planner shortcut learning, e.g., by distinguishing static cars from dynamic cars. Critically , the lack of public route ﬁles or open-source expert code hinder dataset extension and issue resolution. W e therefore follow [ 37 , 43 , 56 ] and utilize the CARLA [ 13 ] Leaderboard 2.0 training routes and the rule- based PDM-lite expert [ 44 , 56 ] for data collection. A.2. Route Generator Apart from the comprehensiv e Bench2Drive dataset, cur- rent CARLA imitation learning is limited by the div ersity of e xpert demonstrations. SotA methods [ 37 , 43 , 56 ] utilize short, single-scenario route segments from CARLA Leader - board 2.0 for training. Howe v er , this approach suffers from signiﬁcantly imbalanced scenario distributions (e.g., scenario InterurbanAdv ancedActorFlow appears only six times, while others appear 40 times). While these methods employ upsampling of rare routes and extensi ve camera- and weather augmentations, this mitigation is limited. Fixed geographic contexts and similar actor behaviors can lead to ov erﬁtting to spurious correlations [ 16 ]. T o overcome these limitations, we build a novel route generator that creates unique route-scenario combinations within CARLA towns, enabling extensi ve and div erse data collection. Our approach maintains the established concept of short, single-scenario training routes, which facilitates ﬁne-grained control over scenario distribution; for simplic- ity , we adopt a uniform distribution. The route generation algorithm proceeds in three steps: (1) Trigger Point Selec- tion, (2) Route Planning and (3) Scenario Generation. unlabeled unlabeled Figure 6. Label Flaws. Bench2Drive exhibits incomplete and erroneous 3D bounding box annotations in various scenes. T rigger Point Selection. W e employ a coarse-to-ﬁne search strategy for trigger point selection. Initially , sce- narios are mapped to one of three coarse location classes: Intersection, No-Intersection, or Highway Ramp. Subse- quently , trigger points meeting these initial criteria are ex- haustiv ely sampled from all CARLA maps. This candidate list is then reﬁned using scenario-speciﬁc criteria. For Inter- sections, relev ant features include trafﬁc lights, stop signs, turning options, bike lanes, and pedestrian crossings. For No-Intersections, we consider the number of adjacent lanes (with/against ego trafﬁc ﬂow), and the presence of parking or shoulder lanes. Highway Ramps are differentiated into On- and Off-Ramps. For instance, the SignalizedJunctionLeftT urn scenario necessitates a signalized intersection with a left-turn option. The V ehicleOpensDoorT woW ays scenario requires a right- side parking lane for the adversary and an adjacent lane with oncoming trafﬁc. Con versely , the Accident scenario demands a right-side shoulder lane and a left-side lane with trafﬁc ﬂowing in the same direction. These examples are visualized in Fig. 7 . Route Planning. W e plan routes through a trigger point using a bidirectional search on the lane graph to determine start and end points. This search is constrained to av oid additional intersections. Distances between the start point, trigger point, and end point are randomly sampled from scenario-dependent intervals. This strategy enhances vari- ance and mitigates the learning of distance- and location- dependent shortcuts. Scenario Generation. The ﬁnal step inv olves conﬁguring CARLA ’ s pre-deﬁned scenario behavior models at the trig- ger point. T o enhance div ersity and prev ent spurious cor- relations, additional scenario parameters are sampled from meaningful intervals, for instance, by varying inter-v ehicle distances within trafﬁc ﬂo ws. 1 ego lane junction traffic lights left exit (a) Signalized Junction Left ego lane parking lane oncoming lane (b) Open Door shoulder lane ego lane neighbor lane (c) Accident Figure 7. Visualization of T rigger Point Selection Criteria. (a) This scenario requires a signalized junction with a left-turn exit relativ e to the ego lane. (b) The V ehicleOpensDoorT woW ays sce- nario requires a right-side parking lane for the adversary and an adjacent left-side lane for oncoming trafﬁc. (c) The Accident sce- nario demands a right-side shoulder lane for the group of blocking vehicles and an adjacent left-side lane with the same traf ﬁc ﬂow . Our route generator produces ov er 100,000 unique route- scenario combinations across all CARLA maps. Due to the high cost of sensor data collection, we utilize a subset of 8,000 routes for our data scaling experiments. Ne verthe- less, the generator’ s extensi ve div ersity remains a signiﬁ- cant asset for applications beyond imitation learning, such as reinforcement learning. B. Implementation Details B.1. Model Efﬁcient Streaming T raining. The training performance of previous BEV -based E2E-AD architectures is bottle- necked by recurrent BEV feature generation, required for fusing temporal information within the BEV encoder’ s tem- poral self-attention layer [ 34 ]. For instance, UniAD [ 24 ] processes four past frames ( t − 4 , . . . , t − 1 ) at each training step t with a frozen BEV backbone to ev entually compute temporal self-attention between F ( t − 1) Bev and F ( t ) Bev . In con- trast, we adapt streaming training from object-centric mod- eling [ 51 ] to BEV -based architectures. Speciﬁcally , dur- ing training, we sample streams of n subsequent frames, feeding them sequentially into the network. A memory component caches current BEV features F ( t ) Bev at each step t , enabling subsequent steps ( t ′ = t + 1 ) to retrieve F ( t ′ − 1) Bev = F ( t ) Bev from cache instead of recurrently recom- puting it. Short sequences (e.g., 2s, n = 20 frames) are streamed to periodically reset the memory and mitigate non- orthogonal gradients [ 17 ] arising from strongly correlated Method BEV FPS (train) ↑ UniAD-tiny [ 24 , 28 ] frozen (stage-2) 2.5 BevAD (our s) frozen 89.0 end-to-end 55.0 T able 5. T raining Efﬁciency . BevAD achieves substantial train- ing speed-up relati ve to UniAD-tin y . Measured on 8xA100-80GB. frames. Our approach av oids four additional BEV back- bone forward passes (compared to UniAD stage-1) and sig- niﬁcantly reduces CPU load by requiring only one set of multi-view images per step, rather than ﬁ v e. Additionally , we emplo y RADIO [ 19 ] as an image back- bone with a lo w-rank adapter (LoRA) [ 22 ] for parameter - efﬁcient ﬁnetuning. RADIO pro vides rich, generic features, and LoRA ﬁnetuning signiﬁcantly reduces VRAM require- ments compared to backpropagating through a ResNet [ 18 ]. Pruning additional heads for online-mapping, motion fore- casting, and occupancy prediction further reduces Be vAD’ s memory footprint. These optimizations combined enable end-to-end train- ing with a batch size of 16 on a single A100-80GB GPU, a signiﬁcant improv ement over UniAD, which required two- stage training with a frozen backbone due to VRAM con- straints. Our optimizations yield signiﬁcantly higher train- ing sample throughput, ev en surpassing UniAD-tiny , as shown in T ab. 5 . Speciﬁcally , BevAD processes 35 × more training samples per second with a frozen BEV backbone, and achiev es a 22 × speed-up in end-to-end training. These optimizations represent a major contrib ution to wards scal- able imitation learning for robust closed-loop performance, a beneﬁt we extend to the community through our open- source code release. Camera A ugmentation. Camera augmentations are in- tegral to robust closed-loop imitation learning [ 25 ]. Per- fect expert dri vers, such as PDM-lite, maintain precise lane centering, leading to a training distribution dominated by ideal states. Howe ver , during closed-loop testing, accumu- lated steering errors can cause the vehicle to drift, resulting in cov ariate shift [ 42 ] and degraded planner performance. A common mitigation in v olves augmenting driver camera views with random shifts and rotations during training [ 25 ], adopted by SotA methods on Bench2Driv e [ 37 , 43 , 56 ]. This augmentation simulates out-of-distribution states not observed with perfect e xpert driving. Howe v er , these shift and rotation augmentations are typ- ically limited to single front-facing camera setups and are challenging to apply to multi-camera systems. Further - more, their application to real-world data necessitates nov el view synthesis, introducing pipeline overhead and poten- tial artifacts. T o address these limitations, we propose a 2 nov el BEV -based augmentation strategy . Instead of manip- ulating raw sensor data, we augment the BEV coordinate system directly . This is achieved by sampling a random transformation matrix B ev T C ar , comprising a small yaw rotation γ ∼ [ − 22 . 5 ◦ , 22 . 5 ◦ ] and a lateral offset ∆ y ∼ [ − 0 . 75 m, 0 . 75 m ] . This matrix is then applied to all cam- era transformation matrices C ar T Cam i : Bev T Cam i = Bev T Car · C ar T Cam i By providing Bev T Cam i to the BEV encoder, it builds BEV features F Bev in the augmented BEV coordinate sys- tem rather than the vehicle coordinate system. Additionally , we apply this transformation to all ground truth labels, i.e., 3D bounding boxes and planning trajectories. Our augmen- tation prev ents the detection head and planner from learning a bias towards axis-aligned objects or trajectories relative to the ego vehicle. For example, the disentangled planner must learn to predict the future path in the augmented BEV space, necessitating a robust understanding of the BEV fea- ture grid. Unlike prior camera augmentations, our BEV - based scheme requires no augmented sensor data, as it only alters the generation of the latent BEV representation via a non-learnable, random transformation. Planning Queries. The planning query’ s interpretation de- termines the output representation. For an entangled trajec- tory representation, we deﬁne P Plan = P Traj ∈ R N t × N c , where N t is the number of trajectory points and N c is the feature dimension. For a disentangled representa- tion, P Plan = [ P Path ∥ P Speed ] , with P Path ∈ R N p × N c and P Speed ∈ R N t × N c , where N p denotes the number of path points. In our experiments, we set N t = 15 for a 3 s plan- ning horizon, resulting in temporal waypoint and speed pre- dictions spaced at 0 . 2 s intervals. For path planning, we use N p = 30 with waypoints spaced 1 m apart. Planning Head. The planning head (Fig. 2b ) comprises N layer = 8 transformer decoder layers with a feature dimen- sion of N c = 512 . Inspired by [ 41 ], we inte grate an adaLN- Zero transformation to condition the self-attention and global cross-attention layers with c , as depicted in Fig. 8 . For the point estimator, c = emb ( command ) + emb ( v 0 ) , combining a learnable embedding of the high-lev el navigation command and a sinusoidal embedding of the current velocity . For the diffusion-based plan- ner , c = emb ( command ) + emb ( v 0 ) + emb ( τ ) , addition- ally incorporating an embedding of the current diffusion timestep τ . B.2. Diffusion-based Planning Preliminaries. Dif fusion models are a class of genera- tiv e models that learn to rev erse a gradual noising pro- cess applied to data [ 20 , 45 ]. They deﬁne a ﬁxed forward Markov chain that progressiv ely adds Gaussian noise to Layer Norm Scale, Shift Multi-Head X-Attention Scale Figure 8. adaLN-Zero Attention. W e employ an adaLN- Zero transformation to condition the transformer on c . For self- attention, we set K = V = Q Plan and for cross-attention, we set K = V = F Scene . data, transforming it into a pure noise distribution. Speciﬁ- cally , the forward process admits sampling of noisy data x τ for arbitrary (time)steps τ in the Marko v chain in closed- form giv en clean data x 0 by sampling Gaussian noise ϵ ∼ N ( 0 , 1 ) [ 20 ]: x τ ( x 0 , ϵ ) = √ ¯ α τ x 0 + √ 1 − ¯ α τ ϵ (1) The sequence ¯ α 0 , . . . ¯ α T typically stems from a variance schedule with constant parameters [ 20 , 46 ]. The core of diffusion models lies in learning a reverse Markov chain that iterativ ely denoises samples over T timesteps, starting from random noise p ( x T ) = N ( x T ; 0 , I ) , to recover data samples from the original distribution [ 20 ]: p θ ( x 0: T ) := p ( x T ) T Y τ =1 p θ ( x ( τ − 1) | x τ ) (2) Instead of learning the Gaussian transitions p θ ( x τ − 1 | x τ ) directly , it is common practice to learn a function approxi- mator f θ : ( x τ , τ ) → x 0 parameterized by a neural network with learnable weights θ [ 39 ]. This allo ws learning the dif- fusion model by minimizing Eq. ( 3 ) for all timesteps τ [ 20 ]: ∥ x 0 − f θ ( x τ , τ ) ∥ (3) For diffusion-based planning, the clean planning ground truth x 0 is deﬁned based on the planning representation: • Entangled: a series of normalized trajectory waypoints, x 0 = { ( x i , y i } N t i =1 } . • Disentangled: a tuple comprising normalized path waypoints and a normalized speed sequence, i.e., x 0 = ( { ( x i , y i } N p i =1 , { ( v i } N t i =1 } ) . T raining. During training, we sample a timestep τ from a uniform distribution and Gaussian noise ϵ ∼ N ( 0 , 1 ) . W e 3 obtain the noisy sample x τ using Eq. ( 1 ) and following the DDIM variance schedule [ 46 ]. The planning head serves as the conditional function approximator f θ : ( ˜ x τ , τ , z ) for the reverse process [ 36 ], tasked with predicting x 0 from ˜ x τ , the dif fusion timestep τ , and conditioning con- text z = ( F Scene , c ) . Note that x τ is embedded into a latent space via an MLP to obtain ˜ x τ before being fed to f θ . Inference. For inference, we start with a random sample x T ∼ N (0 , 1) and iteratively denoise it using the trained function approximator f θ ( ˜ x τ , τ , z ) and the DDIM sampling algorithm [ 46 ]. This process progressiv ely denoises x T to yield the ﬁnal clean prediction x 0 . W e speciﬁcally em- ploy DDIM’ s accelerated generation that conducts denois- ing with a subset of S < T denoising steps to enhance com- putational efﬁcienc y during sampling. DiffusionDrive. W e do not adopt DiffusionDri ve’ s [ 36 ] truncated diffusion schedule in our experiments. This deci- sion stems from our observation, that DiffusionDri ve’ s for- ward process adds noise to a ﬁxed set of anchor trajectories, while the reverse process aims to predict ground truth tra- jectories, leading to an asymmetric reversal. This issue is also thoroughly discussed in concurrent work [ 37 ], which proposes a theoretically sound dif fusion bridge formulation as a solution. B.3. Controller W e adopt the disentangled PID controllers from Sim- lingo [ 43 ] for lateral and longitudinal control. All controller parameters are retained, with the exception of the steering proportional gain, which is slightly lo wered to P steer = 1 . 8 to reduce ov ersteering in our closed-loop agent. B.4. Loss The overall loss function for end-to-end training combines terms for 3D object detection and planning: L = λ det L det + λ plan L plan (4) The detection loss L det , adopted from [ 34 ], includes classi- ﬁcation and regression components. The planning loss L plan varies with the planner’ s modeling choice: • For regression-based planning with a point estimator, L plan is a smooth L 1 loss applied to the trajectory error, or path and speed error . • For dif fusion-based planning, L plan is a smooth L 1 loss on the x 0 -prediction error , as deﬁned in Eq. ( 3 ). The loss coefﬁcients are set to λ det = 1 and λ plan = 100 to balance the magnitudes of the detection and planning loss components. B.5. Metrics T o better characterize closed-loop failure modes, we intro- duce auxiliary metrics: the static infraction rate (IR s ) and dynamic infraction rate (IR d ). IR s = N layout-collision + N outside-lane N routes (5) This metric quantiﬁes infractions related to lateral control errors, such as collisions with static layout elements or driv- ing outside the lane. IR d = N actor-collision + N red-light + N stop-sign N routes (6) This metric captures infractions arising from longitudinal control errors and interactions with dynamic elements, in- cluding collisions with actors, running red lights, or failing to stop at stop signs. Collectiv ely , these metrics provide the expected number of infractions per route, offering a granu- lar understanding of control deﬁciencies. C. Results This section presents an ablation study , quantitativ e results on the Bench2Driv e benchmark, and qualitati ve demonstra- tions of BevAD’ s closed-loop driving. C.1. Ablations Our ablation studies in vestigate early design choices. For the following ablations of camera augmentation and BEV size, our baseline is a planning head with a tokenizer ( p = 4 , masking), disentangled representation, and a point- estimator regressor , as detailed in Sec. 3.3 and T ab . 1 . Fur- thermore, we ablate the number of denoising steps in the diffusion planner using our strongest model, Be vAD-M. Camera A ugmentation. T o assess the effectiv eness of our nov el BEV -based augmentation, we compare the baseline against a variant trained without it. As shown in T ab. 6 , the absence of camera augmentation signiﬁcantly degrades closed-loop performance. This degradation primarily stems from a 5 . 7 × increase in static infractions, leading to re- duced route completion and increased secondary collisions. These results underscore the critical role of our augmen- tation scheme in promoting robust dri ving and enabling recov ery from compounding steering errors. The results also highlight the insufﬁcienc y of common open-loop met- rics, as the L1 trajectory error does not reﬂect the observed degradation in model rob ustness. BEV Size. Our tokenizer compresses high-resolution BEV features ( F Bev ) into low-resolution scene tokens ( F Scene ). An alternative is to directly learn a low-resolution BEV fea- ture space. As shown in T ab. 7 , despite exposing the same number of scene tok ens to the planner , direct low-resolution BEV generation signiﬁcantly degrades closed-loop perfor- mance. Speciﬁcally , the deformable BEV -image cross- attention of our BEV encoder (based on BEVFormer [ 34 ]) extracts image information more sparsely when generating 4 Augmentation DS ↑ SR ↑ IR s ↓ L1 (m) ↓ ✓ 82 . 62 57 . 43 0 . 055 1 . 43 ✗ 66 . 60 33 . 64 0 . 314 1 . 47 T able 6. Ablation of camera augmentation. The absence of cam- era augmentation signiﬁcantly degrades closed-loop performance, despite minimal impact on open-loop L1 trajectory de viation. This underscores the contribution of our BEV -based augmentation to robust dri ving and cov ariate shift mitigation. BEV Scene T okens DS ↑ SR ↑ IR s ↓ 100 × 100 25 × 15 82 . 62 57 . 43 0 . 055 25 × 25 25 × 15 72 . 18 40 . 36 0 . 409 T able 7. BEV Resolution and T okenization. High-resolution BEV generation, compressed via our tokenizer , yields supe- rior closed-loop driving performance compared to direct low- resolution BEV generation. S DS ↑ SR ↑ FPS ↑ 10 88.11 72.73 4.2 5 88.33 72.72 5.8 2 88.53 72.72 7.5 T able 8. Impact of Denoising Iterations. Reducing denoising steps signiﬁcantly boosts inference FPS while preserving closed- loop dri ving performance, enabling real-time applications. FPS measured on a Quadro R TX 8000. low-resolution BEVs, potentially omitting ﬁne details. Fur- thermore, it prev ents le veraging deformable reﬁnement lay- ers for sampling local, high-resolution features around the future trajectory . As a result, we observe 7 . 4 × more static infractions with the low-resolution BEV encoder compared to our compression approach. This ﬁnding underscores the necessity of compressing high-resolution representa- tions rather than directly learning low-resolution ones. Number of denoising steps. The iterati ve denoising of diffusion models critically impacts the inference latency in real-world deployments. The runtime of our diffusion- based planner linearly increases with the number of denois- ing steps S , while the point estimator planner has constant runtime, corresponding to S = 1 . W e thus ev aluate how S affects closed-loop dri ving performance and inference FPS in T ab. 8 . In contrast to Dif fusionDriv e [ 36 ], we achiev e constant driving performance for S ∈ { 2 , 5 , 10 } , without applying their truncated diffusion framework. W e attribute this to a bug in their diffusion schedule, which we detail in the appendix. 1000 2000 4000 8000 16000 T raining Dataset Size 20 30 40 50 60 70 80 Success R ate (%) Mer ging Overtaking Emer gency Br eak Give W ay T raffic Sign Figure 9. Emerging Skills. All skills remain underde veloped with fewer than 4,000 training episodes. As the v olume of training data increases, the skills progressiv ely and uniformly emerge. C.2. Multi-Ability Evaluation T o gain a nuanced understanding of system performance in closed-loop driving, we employ the ﬁne-granular multi- ability ev aluation protocol from Bench2Drive [ 28 ]. This protocol deﬁnes ﬁv e advanced urban dri ving skills: Merg- ing, Overtaking, Emergency Brake, Gi ve W ay , and Traf ﬁc Sign. Each of the 220 test routes is mapped to one or more skills necessary for successfully navig ating the scenario. T ab . 9 presents a comparison of BevAD’ s multi-ability scores against prior work. Be vAD consistently achieves high scores ( > 70% ) across all skills, dominating the mean score. In contrast, prior state-of-the-art methods [ 43 , 48 ] exhibit unev en performance, excelling in some skills while underperforming in others. BevAD-M surpasses the pre- vious best in Merging and Giv e W ay skills by +8.17 and +23.34, respecti vely . These skills demand comprehensi ve surround perception, underscoring BevAD’ s effecti ve uti- lization of its multi-view camera system. Howe ver , BevAD lacks in terms of Overtaking, Emer gency Brake, and T rafﬁc Sign skills compared to the best prior or concurrent methods for each skill. Building on the discussion in Sec. 3.5 , we present the progression of BevAD’ s closed-loop driving skills as the training dataset size increases, detailed in Fig. 9 . W ith fe wer than 4,000 training episodes, all skills exhibit a success rate below 50%. Howe v er , as the training data is doubled and quadrupled, the skills show consistent and uniform im- prov ement. This enhancement is e videnced by high success rates in complex scenarios, such as ov ertaking amidst on- coming trafﬁc, merging onto highways and into trafﬁc ﬂo w at intersections, and yielding to emergenc y vehicles. 5 Method Ability (%) ↑ Merging Overtaking Emergency Brak e Giv e W ay T rafﬁc Sign Mean V AD [ 29 ] 8.11 24.44 18.64 20.00 19.15 18.07 UniAD-Base [ 24 ] 14.10 17.78 21.67 10.00 14.21 15.55 ThinkT wice [ 27 ] 27.38 18.42 35.82 50.00 54.23 37.17 Driv eAdapter [ 26 ] 28.82 26.38 48.76 50.00 56.43 42.08 Hydra-NeXt [ 35 ] 40.00 64.44 61.67 50.00 50.00 53.22 Orion [ 15 ] 25.00 71.11 78.33 30.00 69.15 54.72 TF++ [ 25 ] 58.75 57.77 83.33 40.00 82.11 64.39 Simlingo [ 43 ] 54.01 ± 2 . 63 57.04 ± 3 . 40 88.33 ± 3 . 34 53.33 ± 5 . 77 82.45 ± 4 . 73 67.03 ± 2 . 12 Hip-AD [ 48 ] 50.00 84.44 83.33 40.00 72.10 65.98 BridgeDriv e † [ 37 ] 63.50 58.89 88.34 50.00 88.95 69.93 BevAD-S (our s) 55.83 ± 0 . 72 53.33 ± 6 . 67 63.33 ± 1 . 93 46.67 ± 5 . 77 60.88 ± 3 . 08 56.01 ± 3 . 78 BevAD-M (our s) 71.67 ± 2 . 60 74.07 ± 1 . 29 75.56 ± 4 . 41 76.67 ± 5 . 77 75.44 ± 1 . 61 74.68 ± 1 . 24 T able 9. Multi-Ability Evaluation. BevAD consistenly achie ves high results across all skills, dominating the mean score score. Notably , it signiﬁcantly surpasses prior w ork in the Mer ging and Give W ay skills. Howe ver , it exhibits comparatively lower performance in Overtaking , Emer gency Brak e , and T raf ﬁc Sign skills. Le gend: † : concurrent work. C.3. Qualitative Results W e provide qualitati ve closed-loop dri ving e xamples for each multi-ability skill of BevAD-M in Fig. 10 , 11 , 12 , 13 , 14 and 15 . The examples are best viewed when zoomed in and visualize planned ego trajectories that are generated by rolling out the predicted speed proﬁle along the predicted path. This depiction aids in understanding the dynamic planning component, but is solely for visualization. It does not serve as controller input, nor does it af fect the lateral accuracy of the predicted path. 6 Figure 10. Merging . BevAD merges from a parallel parking space into trafﬁc. It yields to rear-end ﬂow of vehicles, identiﬁes a safe gap, and accelerates for seamless merging. Figure 11. Overtaking (1). BevAD executes an o vertaking maneuver when the route is blocked by a construction vehicle. It waits for clear oncoming trafﬁc, then steers into the opposing lane, accelerates to quickly pass the obstacle, and subsequently decelerates when returning to its original lane. 7 Figure 12. Overtaking (2). BevAD approaches a group of cyclists. It ex ecutes a safe left lane change to ov ertake them. After the maneuver , BevAD returns to its original lane, maintaining a safe distance to the c yclists. Figure 13. Emergency Brake. BevAD brakes at a green-light intersection due to a pedestrian crossing its left-turn path. Driving resumes upon pedestrian clearance. 8 Figure 14. Give W ay . BevAD yields to an oncoming vehicle encroaching on the ego lane. It performs a controlled rightward deviation from the lane center , staying within road limits, and re-centers once the oncoming trafﬁc has passed. Figure 15. T rafﬁc Sign. BevAD stops at a stop sign at an intersection with cross-trafﬁc. It waits for a safe gap, then executes a right turn and merges into the traf ﬁc ﬂo w . 9

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment