RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

RAD - LAD : Rule and Language Grounded Autonomous Driving in Real-Time An urag Ghosh 1 , Sriniv asa Narasimhan 1 , Manmohan Chandrak er 2 , 3 , and F rancesco Pittaluga 2 1 Carnegie Mellon Univ ersity 2 NEC Labs America 3 UC San Diego Abstract. W e present LAD , a real-time language–action planner with an in terruptible arc hitecture that produces a motion plan in a single forw ard pass ( ∼ 20 Hz) or generates textual reasoning alongside a motion plan ( ∼ 10 Hz). LAD is fast enough for real-time closed-loop deplo yment, ac hieving ∼ 3 × lo w er latency than prior driving language mo dels while setting a new learning-based state of the art on nuPlan T est14-Hard and In terPlan. W e also in tro duce RAD , a rule-based planner designed to address structural limitations of PDM-Closed. RAD achiev es state-of- the-art p erformance among rule-based planners on nuPlan T est14-Hard and InterPlan. Finally , we show that combining RAD and LAD enables h ybrid planning that captures the strengths of both approac hes. This hy- brid system demonstrates that rules and learning pro vide complementary capabilities: rules supp ort reliable maneuvering, while language enables adaptiv e and explainable decision-making. Keyw ords: Closed-Lo op Planning · Language Mo dels · Real-Time 1 In tro duction Consider the driving scenarios in Fig. 1 . In the top ro w, an autonomous vehi- cle navigates a left turn at a pickup zone, a maneuver requiring a w areness of lane blo c k ages and intersection geometry . In the bottom ro w, the same v ehicle attempts a righ t turn through dense, ambiguous traﬃc where p edestrians, on- coming vehicles, and unclear right-of-w ay create genuine uncertaint y . Both sce- narios app ear in curren t closed-lo op planning b enc hmarks, yet diﬀer profoundly in character. The ﬁrst demands safe maneuvering to na vigate tight spaces; the second requires situational understanding to in terpret in tent and am biguit y . Impro v emen ts to rule-based systems lik ely cannot address the true semantic long-tail, including situations requiring nuanced understanding of so cial norms, am biguous o cclusions, or negotiable rights-of-w ay [ 17 ]. F or this, we introduce LAD (Language-Based Autonomous Driving), a m ultimo dal large language model (MLLM) planner built for real-time closed-loop deplo yment. A p ersisten t con- cern with language-based planners has been their latency [ 7 , 22 , 41 ]: prior systems op erate at 2-3 Hz, far too slo w for reactive closed-lo op planning, leading some approac hes to emplo y language mo dels as oﬄine advisors [ 7 , 35 ]. LAD ’s inter- ruptible infer enc e architecture addresses this b y producing a v alid plan in a single 2 A. Ghosh et al. Fig. 1: Autonomous driving requires rule-follo wing and seman tic under- standing. (T op r ow) A left turn at a pickup/dropoﬀ zone: the ego vehicle (red, with planned tra jectory) must na vigate around vehicles blo cking the lane. (Bottom r ow) A righ t turn through a cro wded in tersection: dense traﬃc from m ultiple directions, p edes- trian crossings, and ambiguous right-of-w ay require reasoning b ey ond simple tra jectory optimization. T ext o verla ys show LAD ’s real-time situational understanding. The key insigh t is many scenarios lab eled “hard” require only b etter lane-changing whic h our rule-based planner RAD can handle. Semantic diﬃcult y demands language-grounded reasoning, e.g. negotiation of ambiguous traﬃc whic h LAD handles by generating b oth motion plans and interpretable explanations at ∼ 10Hz, enabling real-time deplo yment. forw ard pass ( ∼ 20Hz) and optionally generating textual reasoning when com- pute budget p ermits ( ∼ 10Hz), remaining compatible with safet y mechanisms that require immediate re-planning (Section 3.1 ). Beyond real-time feasibilit y , our ablations sho w that language supervision pro vides complemen tary train- ing signal for closed-lo op planning, yielding strong p erformance on long-tailed b enc hmarks including nuPlan T est14-Hard and InterPlan. W e also design RAD (Rule-Based Autonomous Driving), a structured plan- ner that extends PDM-Closed [ 10 ] with dynamic top ology replanning and goal- directed optimization. RAD rev eals that many scenarios lab eled “hard” in cur- ren t b enc hmarks are geometric in c haracter, resolv able with capabilities like lane c hanges, while truly diﬃcult cases require the semantic reasoning that language enables. Our h ybrid planner, RAD - LAD , com bines strict rule-following with language-based reasoning for b etter p erformance in long-tailed scenarios. Th us, we view autonomous driving planning as consisting of t wo complemen- tary c hallenges: geometric feasibilit y and semantic reasoning. RAD addresses the geometric component by expanding the planner’s searc h space through dynamic RAD-LAD 3 top ology and maneuver priors. LAD addresses the semantic comp onen t b y en- abling language-grounded reasoning o ver am biguous traﬃc interactions. The re- sulting h ybrid system pro vides a practical pathw ay for integrating structured planning and foundation mo dels in real-time autonomy . In summary , our k ey con tributions are: 1. LAD : The ﬁrst real-time language-action planner to achiev e state-of-the- art p erformance on closed-lo op, long-tailed autonomous driving b enchmarks. LAD demonstrates not only that language-based sup ervision improv es plan- ning p erformance, but also that inference-time, language-based reasoning is deplo y able at ∼ 10Hz. 2. RAD : A ﬂexible rule-based planner that achiev es strong p erformance on long-tail b enc hmarks by extending the capabilities of existing rule-based planners suc h as PDM-Closed [ 10 ], revealing that muc h of the b enc hmark diﬃcult y is geometric rather than seman tic. 3. RAD - LAD : An integrated hybrid rule-and-language-based planner that com bines the b est of both worlds – the in terpretable language-based reason- ing and planning of LAD and the physics-based safet y guardrails of RAD – to ac hiev e comp etitiv e closed-lo op planning p erformance. 2 Related W ork 2.1 Language-Based Planning Language-based planning oﬀers a potential solution for the semantic reason- ing gap left by rule and con ven tional learning-based planners Approaches lik e Driv e VLM [ 41 ], DriveGPT4 [ 45 ], and EMMA [ 22 ] ha ve demonstrated strong scene understanding and reasoning capabilities. How ever, these systems are fun- damen tally limited b y latency and op en-loop design. Most operate at sp eeds (e.g., 2–3Hz) insuﬃcient for reactive closed-loop planning or rely on oﬄine pro- cessing [ 40 , 43 ]. Advisory frameworks [ 7 , 35 ] attempt to mitigate this by decou- pling reasoning from planning, but this prev ents true language-guided improvi- sation. A common limitation across these metho ds is that plan generation is tightly coupled to full autoregressive text generation, making latency prop ortional to reasoning depth. LAD addresses this with an interruptible infer enc e arc hitec- ture: a dedicated plan token alwa ys pro duces a v alid tra jectory in a single forw ard pass, while optional reasoning tok ens precede it to improv e planning quality when compute budget p ermits. A phased training curriculum (inspired by BLIP-2 [ 26 ], LLaV A [ 29 ], Pi [ 2 ]) enables 10Hz planning with reasoning, solving the latency b ottlenec k that hindered the deploymen t of prior vision-language-action models. 2.2 Rule-Based and Learned Planning Ev aluation in autonomous driving has shifted from op en-loop metrics to closed- lo op simulation [ 5 ], rev ealing that man y state-of-the-art learning-based planners 4 A. Ghosh et al. struggle to matc h the reliabilit y of rule-based systems like PDM-Closed [ 10 ]. While PDM-Closed excels on standard b enchmarks, its ﬁxed topology means that “long-tail” benchmarks [ 9 , 18 ] partially reﬂect its design constrain ts. RAD addresses these with dynamic replanning and goal-directed optimization (Sec- tion 3 ), signiﬁcantly outp erforming b oth PDM-Closed and recent h ybrid exten- sions [ 21 , 38 ]. LAD .. .. .. .. .. In t e r r u p t i b l e P l a n n i n g Pr o m p t T o k e n s Pl a n T o k e n Re a s o n i n g / A n s w e r T o k e n s De t o k e ni z e r Eg o P l a n : G o s tr a i g h t Ma p En c od e r Ag e n t En c od e r Ad a p t e r Ad a p t e r Prompt: Given Environment and Agent Context, provide the Ego Agent Plan. To k e n i z e r Plan Head Fig. 2: LAD Architecture. W e enco de scene context, adapt it in to the language mo del’s mani- fold, and insert it as pseudo-tokens within the prompt. The deco der pro duces natural-language reason- ing and a motion plan from the hidden state at <|plan|> . Pure imitation-based learned planners (e.g., PlanTF [ 9 ], DiﬀusionPlanner [ 50 ]) of- fer promise for generalization but often ex- hibit po or adherence to safet y constraints in closed-lo op settings. Reinforcemen t learning oﬀers a complementary direction: CaRL [ 23 ] trains an action-based RL p olicy that by- passes the tra jectory-to-con trol in terface en- tirely . As output representation and training paradigm are tightly coupled in this setting, w e treat this as an orthogonal axis and fo cus on impro ving planning through language su- p ervision (see Supplemen tal Material for dis- cussion). 2.3 Hybrid Planning The complementary strengths of rule-based and learned planners hav e motiv ated hybrid approac hes that combine b oth. PLUTO [ 8 ] augmen ts a learned tra jectory predictor with a rules-based scorer, and STR2 [ 38 ] extends this with larger-scale mixture-of-exp erts ar- c hitectures. Similarly , DiﬀusionPlanner [ 50 ] and FlowPlanner [ 39 ] generate tra jectories via generativ e mo dels and reﬁne them with rule- based scoring. RAD - LAD follo ws this hybrid paradigm but integrates a learned language- based planner, enabling in terpretable reasoning alongside rule compliance. 3 Metho d 3.1 LAD : Language-Based Autonomous Driving LAD is an anytime m ultimo dal language mo del planner - it pro duces a v alid motion plan in a single forward pass and optionally generates textual reasoning when computational budget p ermits, enabling real-time closed-lo op deploymen t. Here, “m ultimo dal” refers to the fusion of structured scene en tities (v ectorized map and agen t represen tations) with language. RAD-LAD 5 Arc hitecture. W e transform a pretrained deco der-only language mo del into a motion planner b y in tro ducing three mo diﬁcations: (1) scene enco dings are injected as pseudo-tokens, (2) a planning head is attac hed to a designated output p osition, and (3) inference can b e in terrupted to meet latency constrain ts. Sc ene enc o ding and adaptation. Let M = { m i } N m i =1 denote the set of map ele- men ts (lanes, crossw alks), and A = { a j } N a j =1 denote dynamic and static agen ts. Both are enco ded via PlanTF [ 9 ], z i = ϕ map PlanTF ( m i ) ∈ R d ptf , i = 1 , . . . , N m , (1) u j = ϕ agent PlanTF ( a j ) ∈ R d ptf , j = 1 , . . . , N a . (2) Ligh t w eigh t MLP adapters pro ject these em b eddings in to the model’s token space: ˜ z i = f map ( z i ) ∈ R d ℓ , ˜ u j = f agent ( u j ) ∈ R d ℓ . (3) While we instan tiate LAD with PlanTF for structured inputs, the same adapter pattern extends to vision enco ders for camera or lidar mo dalities. Multimo dal pr ompting. F ollowing ob ject-cen tric tok eniz ation [ 40 ], we inject adapted em b eddings as pseudo-tokens delimited b y sp ecial tokens and in terleav e them with a natural-language task prompt. The deco der-only language mo del attends o v er this heterogeneous con text as it would attend o v er a purely textual input. Planning De c o der. Rather than generating wa yp oints autoregressively , w e for- m ulate planning as classiﬁcation ov er a discrete tra jectory vocabulary follow- ing prior w orks [ 27 , 36 , 43 ], V = { v k } K k =1 , where K is the v o cabulary size and eac h protot yp e v k ∈ R T × 2 is a tra jectory of T wa yp oin ts. A small MLP head g : R d ℓ → R K at the <|plan|> tok en pro duces logits s = g ( h plan ) . The classiﬁer is trained with the imitation soft cross-en tropy loss [ 27 ], where targets are de- riv ed from the pro ximity of eac h protot yp e to the ground-truth tra jectory v ∗ , y k = exp  −∥ v k − v ∗ ∥ 2  P K j =1 exp  −∥ v j − v ∗ ∥ 2  , L plan = log K X k =1 e s k − K X k =1 y k s k . (4) The classiﬁcation head follows BER T or ViT class-token heads [ 12 , 13 ] and GPT’s task-sp eciﬁc output heads [ 34 ], adapted to a deco der-only con text. T extual Sup ervision. When reasoning (or answer) text is av ailable, we train the mo del with teacher forcing ov er resp onse tokens, L language = − X t ∈ Ω log p θ  w ⋆ t   w ⋆ di- rectly after the scene tokens and p er- form a single preﬁll pass to obtain h plan , yielding a motion plan with one forw ard pass. When computational budget p ermits, we allow the mo del to generate reasoning tok ens before in- serting <|plan|> , trading latency for in terpretabilit y . The arc hitecture re- mains identical in b oth cases; only the prompt structure c hanges. More broadly , b ecause the plan tok en yields a v alid action whether or not reasoning tokens precede it, this architecture is compatible with deploymen t where a safety mechanism requires an immediate action. This design enables LAD to op erate in real-time without reasoning or with short justiﬁcations, while remaining compatible with standard inference opti- mizations including KV-caching [ 31 ], op erator fusion [ 37 ], and eﬃcient schedul- ing [ 25 ]. Detailed latency analysis is pro vided in Section 5.4 . Multimo dal T raining T raining a multimodal planner requires balancing the preserv ation of language represen tations while acquiring new capabilities. W e adopt the follo wing training curriculum to ac hiev e this balance. Stage A: Alignment. The language mo del remains frozen while only light weigh t adapters and m ultimo dal pro jection la yers are trained. This establishes stable grounding of scene enco dings within the mo del’s existing representational space, follo wing alignmen t strategies in prior m ultimo dal work [ 26 , 29 ]. RAD-LAD 7 Stage B: L oRA ﬁnetuning. LoRA mo dules [ 19 ] and the task-sp eciﬁc planning head are in tro duced while the bac kb one remains frozen. The zero-initialized up dates of LoRA provide controlled capacity expansion for seman tic grounding and tra jectory prediction without destabilizing pretrained representations. T o main tain linguistic abilities of LAD , w e employ a mixed training strategy that co-trains on a small proportion of the Interaction QA dataset alongside the planning ob jective, follo wing prior work [ 2 , 52 ] and mitigating catastrophic forgetting of pretrained represen tations [ 33 ]. 3.2 RAD : Rule-Based Autonomous Driving State-of-the-art rule-based planners lik e PDM-Closed [ 10 ] p erform remark ably w ell on standard scenarios (i.e., V al14 [ 5 ]) but score badly on long-tailed scenarios (T est14-Hard [ 9 ] and InterPlan [ 18 ]). These failures stem from sp eciﬁc design c hoices in the baseline (e.g., ﬁxed top ology , no deadlock handling) rather than in trinsic scenario complexity . RAD addresses PDM-Closed’s key limitations such as static top ology and no lane changes with dynamic top olo gy r eplanning , lane- change c ap ability , and go al-dir e cte d optimization . Revisiting Rule-Based Planning in PDM. PDM-Closed [ 10 ] planner selects the optimal tra jectory π ∗ b y maximizing a scoring function. T o align with the oﬃcial nuPlan ev aluation metrics, this is formulated as a cost function J PDM comp osed of multiplicativ e p enalties (safety constraints) scaling a weigh ted sum of driving qualit y ob jectives, J PDM ( π ) = C col C ra C mp  w ttc C ttc + w dr C dr + w sp C sp + w ep C ep + w cf C cf  (7) T erms of J PDM include multiplicativ e p enalties for collision C col , violating driv able area C ra , not making minimum progress C mp , and w eighted costs for time-to-collision C ttc , speed compliance C sp , rogress along the exp erts’ route C ep , direction compliance C dr , and comfort C cf . While robust on standard scenarios, PDM-Closed [ 10 ] has a ﬁxed top ology: (1) it do es not support lane changes; (2) it enforces strict p enalties hindering necessary ev asive maneuv ers; and (3) it does top ological planning only once at the start, causing drift. RAD addresses these limitations via the following mo diﬁcations. Dynamic T op ological Replanning. PDM-Closed [ 10 ] generates 15 candi- dates per timestep but anc hors them to proposal paths ( Γ static ) ﬁxe d at ini- tialization , i.e. there is no top ological replanning after initialization. If the ego deviates or paths b ecome blo c k ed, prop osals are never up dated. RAD performs ful l top olo gic al r eplanning at every timestep . W e deﬁne the prop osal paths extraction as a time-dep enden t function of the current ego state s t and the map M , Γ t = GraphSearch ( s t , M ) . (8) 8 A. Ghosh et al. Go a l Go a l Eg o Eg o Bl o ck e d Pr o p o s al Pa t h Fea sib le Proposal Pa th s PDM - Closed RAD Fig. 4: PDM-Closed’s [ 10 ] static proposal paths become block ed by obstacles with no reco very . RAD topologically replans at ev- ery timestep and augments the route with adjacen t-lane centerlines. Consequen tly , the set of av ailable tra jectory proposals Π t is dynami- cally up dated to reﬂect the instanta- neous top ology , Π t = [ γ ∈ Γ t [ o ∈O IDM ( s t , γ , o, v 0 ) . (9) Here, O denotes discrete lateral oﬀsets and v 0 is the IDM reference v e- lo cit y . This ensures the optimization horizon alwa ys extends from the vehi- cle’s actual curren t p ose, allowing for robust reco v ery if the v ehicle is forced oﬀ the nominal path. Lane-Changing via T op ology Augmentation. T o enable lane c hanges, RAD augments the road top ology with adjac ent-lane c enterlines . While PDM- Closed [ 10 ] considers a single route-based centerline Γ ego , RAD e xpands this to include spatially adjacen t cen terlines Γ adj , ev en those with opposing traﬃc ﬂow, Γ RAD = { Γ ego } ∪ { Γ left , Γ right } ∪ Γ opp . (10) The prop osal set is then expanded to sample tra jectories relative to all centerlines in this augmen ted set, Π aug = [ γ ∈ Γ RAD [ o ∈O IDM ( s t , γ , o, v 0 ) , (11) where O denotes the set of discrete lateral oﬀsets. This allo ws the planner to sample from a ric her family of tra jectories, including feasible lane-change prop osals. Goal-Directed Optimization. RAD modiﬁes the ob jective to encourage deci- siv e progress to ward the mission goal. Instead of relying solely on path-in tegrated distance, RAD computes a Euclide an distanc e-to-go al cost. Let p π T b e the p osi- tion of the ego vehicle at the end of planning horizon T for proposal π , and g b e the global goal co ordinates, J goal ( π ) = ∥ p π T − g ∥ 2 . (12) The total cost function J RAD linearly com bines the baseline PDM cost with this goal-seeking term, J RAD ( π ) = J PDM ( π ) + w goal J goal ( π ) . (13) This optimization encourages adv ancement in open regions and helps escape lo cal minima induced by complex road geometries. RAD-LAD 9 T ra jectory Prop osal Augmentation via V o cabulary . T o diversify candi- date tra jectories b eyond geometric cen terlines, RAD incorp orates all K pro- p osals from a precomputed tr aje ctory vo c abulary . W e construct a v o cabulary V = { v k } K k =1 b y clustering ego tra jectories from the nuPlan training set, fol- lo wing prior w orks [ 27 , 36 , 43 ]. The ﬁnal prop osal set Π RAD is the union of top ologically augmented IDM [ 42 ] prop osals and the data-driven vocabulary , Π RAD = Π aug ∪ { T ego ( v ) | v ∈ V } . (14) This injects data-driven maneuver priors (e.g., sw erves, b ypasses) into the rule- based system. Bl o ck e d Tr a j e c t o r y Au g m e n t e d Tr a j e c t o r y Di s t a nc e To G o a l Go a l Go a l Eg o PD M - Cl o s e d RAD Fig. 5: RAD combines goal-directed opti- mization with tra jectory p roposal augmen- tation, adding feasible alternativ e tra jec- tories and fav oring tra jectories that make progress to ward the goal. Con text-A ware Rule Relaxation. T o handle deadlo c k situations, RAD in tro duces a soft rule r elaxation mec hanism. W e deﬁne a relaxation indicator I relax ∈ { 0 , 1 } , active when blo ck age is detected. When ac- tiv e, RAD contextually do wn w eigh ts p enalties for driving-direction and driving-area violations, allo wing the optimizer to consider safe, short-term deviations from nominal traﬃc rules to circum v en t obstacles. 3.3 Hybrid Planner In tegration While LAD and RAD are state-of-the-art metho ds in their respective class of methods, that is, within learned and rule-based planners, creating a hybrid planner requires careful co-design accoun ting for the downstream LQR con troller and rules scorer. T ra jectory Reﬁnemen t Head. A dding a tra jectory reﬁnemen t module pro- vides LAD the abilit y to reﬁne its tra jectory for b etter rules alignmen t for hybrid in tegration. The classiﬁer’s argmax selects a coarse prototype ˆ v ∈ R T × 2 , which is concatenated with the plan token embedding and passed through an MLP head r : R d ℓ +2 T → R 2 T to predict p er-w aypoint oﬀsets, ˜ v = ˆ v + r  [ h plan ⊕ v ec( ˆ v )]  Crucially , the ﬁnal la y er of r is zero-initialised [ 49 ] so that the tra jectory reﬁnemen t head acts as the identit y at the start of training and does not in terfere with the classiﬁer’s learning signal. The reﬁnemen t loss is, L reﬁne = 1 T T X t =1 ∥ ˜ v t − v ∗ t ∥ 2 . (15) 10 A. Ghosh et al. Where is Agent 4 wi th respect to Ego Agent ? Agent 4 i s behind the ego vehicle. Is it safe for Ego Agent to stop ? No, it is not safe for Ego Agen t to stop because there is a vehicle approaching from the left. What is the Agent ID of the vehicle that is approaching from the left ? Agent 1 Is it safe f or Ego Agent to proceed ? Ye s , i t i s s a fe fo r e g o a g e n t to procee d. Is the pedestrian in front of ego agent ? Assume pedestrian is already static and does not mov e when honked, what should ego agent do ? The pedestrian is static and does not move, so the ego vehicle should not need to yield. Ego vehicle should y ield to the pedestrian. No, the pedestrian is no t in front of the Ego Ag ent. If there were t o be in front of Ego agent, what should ego agent do ? Is it safe t o take a right turn ? No, it is not saf e to tak e a right turn. Why ? Because the pedestri an is crossing the str eet. (a) (b) 3 49 24 10 53 (c) Fig. 6: Scenario Understanding and Reasoning. (a) Cr oss-mo dal gr ounding : LAD can interpret pedestrians, lane geometry , and turn-safety from scene context. (b) R elational r easoning : LAD can infer spatial relations betw een agen ts and antic- ipates conﬂicts. (c) Conditional r e asoning : LAD can ev aluate coun terfactuals, apply traﬃc la ws, and identify when rules may be safely relaxed. Rules-Based Reﬁnemen t. F ollowing prior w ork [ 21 , 38 , 50 ] we adopt a cost- based reﬁnement approac h, expanding the v o cabulary of the rule-based planner b y adding learning based planner’s plan as an additional prop osal, further ap- plying oﬀsets to the model outputs to augmen t the rule-based planner with additional candidate tra jectories. All the candidate tra jectories are then scored using a rules scorer, whic h can b e either PDM-Closed’s [ 10 ] or RAD’s scorer. 4 Exp erimen tal Details Datasets. W e train and ev aluate our metho ds on the n uPlan sim ulator and dataset [ 5 ] which is a closed-lo op simulator grounded in real-w orld driving logs. W e do not fo cus on Na vSim [ 11 ] as it do es not p erform true closed-lo op ev alu- ation and CARLA [ 14 ] which lac ks realistic logged driving data. W e train on 1 million nuPlan scenarios following parit y with prior work [ 8 , 9 , 39 , 50 ]. T o pro vide div erse reasoning sup ervision, we construct DrivingQA and PlanningQA (See Section E for more details), synthesizing QA pairs from In terDrive [ 6 ] b eha vior annotations using Qwen2.5-32B [ 1 ] mo del and ground tra jectories with textual reasoning. This syn thetic dataset generates instructions grounded in v alid tra- jectories, bridging the gap b et ween raw b eha vioral data and semantic reasoning. Benchmarks. W e fo cus our ev aluation on tw o challenging b enc hmarks: – n uPlan T est14-Hard [ 9 ], a subset of 14 scenario types sp eciﬁcally ﬁltered to include diﬃcult cases where the PDM-Closed baseline fails. RAD-LAD 11 T able 1: T est14-Hard (Reactive) and In terPlan results. RAD riv als h ybrid metho ds without learned comp onen ts while LAD sets a new state-of-the-art among learned planners while oﬀering textual reasoning ability . Type Planner T est14-Hard In terPlan Expert Log Repla y 85.96 – Rule IDM [ 42 ] 62.26 31 PDM-Closed [ 10 ] 75.19 42 RAD (Ours) 80.53 72 Learned PlanTF [ 9 ] 61.61 32 PLUTO [ 8 ] 59.74 – DiﬀusionPlanner [ 50 ] 69.22 25 FlowPlanner [ 39 ] 70.42 – LAD (Ours) 70.77 40 Hybrid PLUTO [ 8 ] 76.88 49 STR2-CKS-800m [ 38 ] 78.58 45 STR2-CPKS-800m [ 38 ] 82.02 45 DiﬀusionPlanner [ 50 ] 82.00 – FlowPlanner [ 39 ] 80.25 – RAD- LAD (Ours) 81.36 74 – In terPlan [ 18 ], a synthetic b enchmark designed to test multi-agen t in ter- action and deadlo c k resolution in driving. Baselines. W e compare against three categories: – Rule-Based : W e ev aluate our planners against PDM-Closed [ 10 ] (the n u- Plan c hallenge winner) and IDM. – Learning-Based : W e compare against PlanTF [ 9 ] (a pure T ransformer ap- proac h), PLUTO [ 8 ] (hybrid scoring), STR2 [ 38 ] (a vision-cen tric raster-map- based approach), and Diﬀusion-Planner [ 50 ] and Flo wPlanner [ 39 ]. These represen t the curren t state-of-the-art in imitation learning based planners. – Multimo dal Language-based Planners : Unfortunately , no Closed-Lo op Language-based Planners hav e been proposed for the nuPlan b enchmark. Th us, w e b enc hmark latency against Drive VLM [ 41 ] and DriveGPT4 [ 44 ] to con textualize LAD ’s real-time p erformance. Implementation Details. RAD : W e implement all the c hanges to PDM-Closed suc h as top ological replanning using the tuplan_gar age framework. LAD : W e use Qwen3-0.6B [ 46 ] as the language bac kb one. T o enco de map, agen ts and deco ding agent predictions, we employ PlanTF [ 9 ]’s architecture. T o ac hieve real-time performance with LAD , we implemen t a custom infer- ence back end on a fork of nano-vl lm [ 16 ] with KV-cac hing [ 31 ] and op erator fusion [ 37 ]. All inference experiments are conducted on a single NVIDIA R TX A6000 to v erify deplo ymen t feasibilit y . 5 Exp erimen tal Results W e ev aluate RAD , LAD , and their h ybrid com bination across n uPlan T est14 splits and the In terPlan b enc hmark. 12 A. Ghosh et al. 5.1 P erformance on NuPlan Long-T ail Scenarios T able 1 summarizes p erformance across T est14-Hard, and InterPlan. W e ﬁrst discuss T est14-Hard results. RAD achiev es 80.53 in the reactive setting, out- p erforming PDM-Closed by 5.34 p oin ts while riv aling h ybrid metho ds. This gap arises not from sophisticated reasoning but from RAD ’s ability to p erform lane c hanges and replan dynamically . PDM-Closed does not include these capabili- ties. Critically , as shown in T able 9 of the supplemen tary material, RAD also outp erforms PDM-Closed on the standard n uPlan split (V al14), demonstrating that RAD is not o v erﬁt to long-tail scenarios. This ﬁnding has important implications for ho w we in terpret benchmark diﬃcult y . Th e T est14-Hard split ﬁlters for PDM-Closed failures [ 9 ], implying that man y “hard” cases can be resolv ed once a planner can c hange lanes or reco v er from oﬀ-route situations. RAD nearly matches the strongest prior hybrid metho ds without any learned comp onen ts. Among learning-based approaches, LAD ac hiev es the strongest reactive p er- formance, surpassing all existing metho ds [ 9 , 39 , 50 ]. 5.2 Generalization to Syn thetic Long-T ail Scenarios T o ev aluate generalization to more diverse long-tail conditions, w e turn to In- terPlan [ 18 ], a benchmark sp eciﬁcally designed to stress-test planners on more realistic long-tail scenarios. InterPlan augmen ts n uPlan scenarios with additional agen ts, obstacles , and alternativ e na vigation goals, creating situations that re- quire m ulti-agen t co ordination and deadlo c k resolution. The In terPlan column of T able 1 rev eals sev eral insigh ts. First, RAD dra- matically outperforms all rule-based planners nearly doubling prior rule-based state of the art [ 10 ]. This conﬁrms that the arc hitectural improv ements in RAD (lane changes, goal-directed optimization, rule relaxation) pro vide broad b eneﬁts for long-tailed situations. Second, LAD impro ves performance when compared to other learned closed- lo op planners. This demonstrates that real-time language-based reasoning can meaningfully impro v e planning in in teractiv e scenarios. Our h ybrid planner, RAD - LAD reaches 74, outp erforming sev eral other hy- brid approaches. The complementary nature of rules and learning based planning is evident: rules handle robust maneuvering and deadlock resolution, while LAD con tributes contextual reasoning for situations where rules alone are insuﬃcien t. 5.3 Ablations T able 2 ablates the incremental contribution of eac h LAD comp onent on T est14- Hard (Reactive), starting from the PlanTF [ 9 ] baseline. F or RAD ablations, please see Section C . The original PlanTF baseline considers only 32 dynamic agents; increasing this to 128 do es not improv e p erformance, likely b ecause the additional distant agen ts in tro duce noise without pro viding useful planning signal. A dding static RAD-LAD 13 T able 2: LAD comp onen t ablations on T est14-Hard (Reactiv e). Each com- p onen t con tributes incrementally , with language sup ervision through DrivingQA and PlanningQA providing complemen tary gains on top of input and architectural improv e- men ts. T extual reasoning acts as a useful inductive bias for tra jectory prediction. Component T est14-Hard (R) PlanTF 61.61 + 128 Ob jects 59.73 + Static Ob jects 68.49 + Plan T oken 68.90 + LLM/DrivingQA 69.75 + PlanningQA ( LAD ) 70.77 ob jects (e.g., barriers, cones) conﬁ rms that static scene context is critical for safe maneuvering, making PlanTF [ 9 ] comp etitiv e with DiﬀusionPlanner [ 50 ], PLUTO [ 8 ], and FlowPlanner [ 39 ]. This is consistent with recent ﬁndings that, for planning and control, architecture and input qualit y [ 9 ] dominate ov er the c hoice of training ob jective [ 30 ]. In tro ducing the plan tok en (Section 3 ) v alidates that single-step classiﬁca- tion is eﬀective for extracting wa yp oin ts from hidden state. Incorp orating Driv- ingQA with our multimodal large language model, further impro ves p erformance b y 0.85 p oin ts, indicating that div erse language sup ervision provides useful in- ductiv e bias for tra jectory prediction even when the QA conten t is not directly conditioned on the planning output. Finally , adding PlanningQA with ego b e- ha vior text (scenario t yp e, meta-action) that is temporally aligned with the ground-truth tra jectory yields the ﬁnal LAD model at 70.77, demonstrating that action-aligned textual sup ervision provides complementary learning signal for tra jectory prediction. 5.4 Real-Time Performance A p ersisten t concern with closed-lo op language-based planners is latency - prior literature hav e widely regarded these planners as to o slow for closed-lo op de- plo ymen t [ 7 , 22 , 24 , 41 ]. Some prior w ork sidestep this failure-mode entirely by using language mo dels only as oﬄine advisors [ 7 , 35 ]. W e show this trade-oﬀ may not b e necessary and T able 3 shows that this limitation is not fundamental. In con trast, LAD op erates at 20Hz (43ms) without reasoning and main tains appro ximately 10Hz op eration (102ms) with 10 output tokens which is suﬃcien t for real-time justiﬁcations. Note that without reasoning LAD runs faster than DiﬀusionPlanner [ 50 ] and Flo wPlanner [ 39 ], which pro duce no textual output at all. Our current implementation leav es signiﬁcan t ro om for further acceleration through orthogonal optimizations suc h as impro v ed quan tization strategies [ 47 ], suggesting that the gains rep orted here represent a low er b ound. 5.5 Qualitativ e Analysis Bey ond quantitativ e metrics, LAD ’s m ultimo dal phased training (Section 3.1 ) enables it to reason conv ersationally ab out driving scenarios. Figure 6 illustrates three capabilities essen tial for a language-based planner. 14 A. Ghosh et al. T able 3: LAD achiev es real-time language-based planning. Without reasoning, LAD runs at 43 ms ( ∼ 20 Hz), comparable to recent closed-lo op planners. With reason- ing enabled (10 tok ens), LAD op erates at ∼ 10 Hz, demonstrating that language-based planning need not sacriﬁce latency for textual reasoning. Model Reasoning Latency (ms) Hardware Drive VLM [ 41 ] Y es 410 Orin X–2 DriveGPT4-V2-8B [ 44 ] No 2500 — DriveGPT4-V2-1.5B [ 44 ] No 345 — DriveGPT4-V2-0.5B [ 44 ] No 124 — PlanTF [ 9 ] No 12 A6000 DiﬀusionPlanner [ 50 ] No 50 A6000 FlowPlanner [ 39 ] No 83 A6000 LAD No 43 A6000 LAD (10 tokens max) Y es 102 A6000 LAD (40 tokens max) Y es 222 A6000 In example (a), LAD demonstrates cross-modal situational grounding: it in terprets lane topology , p edestrian motion, and turn geometry directly from the scene context, enabling it to judge turn-safety and explain its reasoning. Example (b) highlights relational reasoning, where LAD identiﬁes spatial relationships b et ween agen ts (e.g., which vehicle is behind or approaching) and uses these relations to anticipate p oten tial conﬂicts. Example (c) shows conditional and rule-a w are reasoning: LAD ev aluates counterfactuals, applies traﬃc la ws such as p edestrian right-of-w ay , and understands when rules can b e safely relaxed. These capabilities translate directly into improv ed closed-lo op b ehavior. In App endix B.3 , w e presen t additional visual comparisons sho wing LAD navi- gating complex intersections, roundab outs, and blo c ked-lane scenarios while ar- ticulating its high-level inten t. In each case, the generated reasoning remains consisten t with the scene lay out and executed tra jectory , and LAD outp erforms PlanTF in scenarios requiring adaptive decision-making. W e also pr ovide vide o demonstr ations for the inter este d r e ader. 6 Conclusion W e presented tw o complementary approaches for autonomous driving planning that address diﬀeren t asp ects of real-world complexit y . RAD demonstrates that carefully designed rule-based planners remain highly comp etitiv e whe n equipped with richer top ology exploration and goal-directed optimization, substantially impro ving geometric maneuvering capabilities while retaining the reliability and in terpretabilit y of structured planning. In parallel, LAD in tro duces the ﬁrst real-time language-action planner for closed-lo op driving. Through interruptible inference, LAD produces v alid tra- jectories in a single forward pass while optionally generating language-based rea- soning when compute p ermits, enabling seman tic understanding of ambiguous traﬃc without sacriﬁcing resp onsiv eness required for safety-critical systems. T ogether, these metho ds illustrate a practical path tow ard combining struc- tured planning and foundation mo dels in autonomous driving. Our results with RAD-LAD 15 RAD - LAD suggest that rules and language-grounded learning oﬀer complemen- tary capabilities, yielding systems that are both robust in routine driving and adaptable to the long tail of real-w orld scenarios. References 1. Bai, S., Chen, K., Liu, X., W ang, J., Ge, W., Song, S., Dang, K., W ang, P ., W ang, S., T ang, J., et al.: Qwen2. 5-vl tec hnical rep ort. arXiv preprin t (2025) 10 , 28 2. Blac k, K., Bro wn, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., F usai, N., Galliker, M.Y., et al.: π 0. 5: a vision-language-action model with op en-w orld generalization. arXiv preprint arXiv:2504.16054 (2025) 3 , 7 , 19 3. Blac k, K., Galliker, M.Y., Levine, S.: Real-time execution of action ch unking ﬂo w p olicies. arXiv preprin t arXiv:2506.07339 (2025) 26 4. Bo jarski, M., Del T esta, D., Dw orako wski, D., Firner, B., Flepp, B., Go yal, P ., Jac kel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016) 26 5. Caesar, H., Kabzan, J., T an, K.S., F ong, W.K., W olﬀ, E., Lang, A., Fletcher, L., Beijb om, O., Omari, S.: n uplan: A closed-lo op ml-based planning b enc hmark for autonomous v ehicles. arXiv preprint arXiv:2106.11810 (2021) 3 , 7 , 10 , 25 , 27 , 28 6. Chang, W.J., Zhan, W., T omizuk a, M., Chandraker, M., Pittaluga, F.: Langtra j: Diﬀusion mo del and dataset for language-conditioned tra jectory simulation. In: ICCV (2025) 10 , 19 , 27 , 30 7. Chen, Y., Ding, Z.h., W ang, Z., W ang, Y., Zhang, L., Liu, S.: Asynchronous large language model enhanced planner for autonomous driving. In: ECCV (2024) 1 , 3 , 13 , 20 8. Cheng, J., Chen, Y., Chen, Q.: Pluto: Pushing the limit of imitation learning-based planning for autonomous driving. arXiv preprint arXiv:2404.14327 (2024) 4 , 10 , 11 , 13 , 18 , 19 , 25 9. Cheng, J., Chen, Y., Mei, X., Y ang, B., Li, B., Liu, M.: Rethinking imitation-based planners for autonomous driving. In: ICRA (2024) 4 , 5 , 7 , 10 , 11 , 12 , 13 , 14 , 18 , 19 , 22 , 23 , 25 , 26 10. Dauner, D., Hallgarten, M., Geiger, A., Chitta, K.: Parting with misconceptions ab out learning-based vehicle motion planning. In: CoRL (2023) 2 , 3 , 4 , 7 , 8 , 10 , 11 , 12 , 25 11. Dauner, D., Hallgarten, M., Li, T., W eng, X., Huang, Z., Y ang, Z., Li, H., Gilitsc henski, I., Iv anovic, B., P av one, M., et al.: Na vsim: Data-driv en non-reactive autonomous v ehicle simulation and b enchmarking. NeurIPS (2024) 10 12. Devlin, J., Chang, M.W., Lee, K., T outanov a, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: NAACL-HL T (2019) 5 13. Doso vitskiy , A., Beyer, L., Kolesnik ov, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., et al.: An image is worth 16x16 w ords: T ransformers for image recognition at scale. In: ICLR (2021) 5 14. Doso vitskiy , A., Ros, G., Co devilla, F., Lop ez, A., Koltun, V.: CARLA: An op en urban driving simulator. In: CoRL (2017) 10 15. Driess, D., Xia, F., Sa jjadi, M.S., Lynch, C., Chowdhery , A., Ic hter, B., W ahid, A., T ompson, J., V uong, Q., Y u, T., et al.: Palm-e: an em b odied m ultimo dal language mo del. In: ICML (2023) 19 16 A. Ghosh et al. 16. GeeeekExplorer: nano-vLLM: A light weigh t vllm implemen tation built from scratc h. https://github.com/GeeeekExplorer/nano- vllm (2025) 11 17. Ghosh, A., Zheng, S., T am buro, R., V uong, K., Alv arez-Padilla, J.R., Zhu, H., Cardei, M., Dunn, N., Mertz, C., Narasimhan, S.G.: Roadw ork: A dataset and b enc hmark for learning to recognize, observ e, analyze and driv e through w ork zones. In: ICCV (2025) 1 18. Hallgarten, M., Stoll, M., Zell, A.: Can vehicle motion planning generalize to real- istic long-tail scenarios? In: IROS (2024) 4 , 7 , 11 , 12 19. Hu, E.J., Shen, Y., W allis, P ., Allen-Zh u, Z., Li, Y., W ang, S., W ang, L., Chen, W.: Lora: Lo w-rank adaptation of large language mo dels. In: ICLR (2022) 7 20. Hu, Y., Y ang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., W ang, W., et al.: Planning-oriented autonomous driving. In: CVPR (2023) 26 21. Huang, Z., Liu, H., Lv, C.: Gameformer: Game-theoretic modeling and learning of transformer-based interactiv e prediction and planning for autonomous driving. In: ICCV (2023) 4 , 10 22. Hw ang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Co v- ington, P ., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. T ransactions on Machine Learning Researc h (2024) 1 , 3 , 13 , 18 23. Jaeger, B., Dauner, D., Beißw enger, J., Gerstenec ker, S., Chitta, K., Geiger, A.: Carl: Learning scalable planning policies with simple rewards. In: CoRL (2025) 4 , 26 24. Jiang, S., Huang, Z., Qian, K., Luo, Z., Zhu, T., Zhong, Y., T ang, Y., Kong, M., W ang, Y., Jiao, S., et al.: A surv ey on vision-language-action mo dels for au- tonomous driving. arXiv preprint arXiv:2506.24044 (2025) 13 25. K won, W., Li, Z., Zh uang, S., Sheng, Y., Zheng, L., Y u, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Eﬃcient memory management for large language mo del serving with P agedAtten tion. In: SOSP (2023) 6 26. Li, J., Li, D., Sav arese, S., Hoi, S.C.H.: Blip-2: Bo otstrapping language-image pre- training with frozen image enco ders and large language mo dels. In: ICML (2023) 3 , 6 27. Li, Z., Li, K., W ang, S., Lan, S., Y u, Z., Ji, Y., Li, Z., Zh u, Z., Kautz, J., W u, Z., et al.: Hydra-mdp: End-to-end multimodal planning with multi-target hydra- distillation. arXiv preprint arXiv:2406.06978 (2024) 5 , 9 28. Li, Z., Y u, Z., Lan, S., Li, J., Kautz, J., Lu, T., Alv arez, J.M.: Is ego status all you need for op en-loop end-to-end autonomous driving? In: CVPR (2024) 18 29. Liu, H., Li, C., W u, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS (2023) 3 , 6 30. P an, C., Anantharaman, G., Huang, N.C., Jin, C., Pfrommer, D., Y uan, C., Per- men ter, F., Qu, G., Boﬃ, N., Shi, G., et al.: Much ado ab out noising: Disp elling the myths of generativ e rob otic control. arXiv preprin t arXiv:2512.01809 (2025) 13 31. P op e, R., Douglas, S., Chowdhery , A., Devlin, J., Bradbury , J., Heek, J., Xiao, K., Agra wal, S., Dean, J.: Eﬃciently scaling transformer inference. MLSys (2023) 6 , 11 32. Prak ash, A., Chitta, K., Geiger, A.: Multi-modal fusion transformer for end-to-end autonomous driving. In: CVPR (2021) 26 33. Que, H., Liu, J., Zhang, G., Zhang, C., Qu, X., Ma, Y., Duan, F., Bai, Z., W ang, J., Zhang, Y., et al.: D-cpt law: Domain-sp eciﬁc contin ual pre-training scaling law for large language mo dels. NeurIPS (2024) 7 34. Radford, A., Narasimhan, K., Salimans, T., Sutskev er, I., et al.: Improving lan- guage understanding b y generative pre-training. OpenAI T echnical Rep ort (2018) 5 RAD-LAD 17 35. Sharan, S., Pittaluga, F., Chandraker, M., et al.: Llm-assist: Enhancing closed-lo op planning with language-based reasoning. arXiv preprint arXiv:2401.00125 (2023) 1 , 3 , 13 36. Shi, S., Jiang, L., Dai, D., Sc hiele, B.: Motion transformer with global inten tion lo calization and local mov ement reﬁnemen t. NeurIPS (2022) 5 , 9 37. Sp ector, B., Jura vsky , J., Sul, S., Dugan, O., Lim, D., F u, D., Arora, S., Ré, C.: “lo ok ma, no bubbles! designing a low-latency megakernel for llama-1b”. Blog p ost, Hazy Researc h, Stanford (May 2025), https: / / hazyresearch . stanford . edu/blog/2025- 05- 27- no- bubbles 6 , 11 38. Sun, Q., W ang, H., Zhan, J., Nie, F., W en, X., Xu, L., Zhan, K., Jia, P ., Lang, X., Zhao, H.: Generalizing motion planners with mixture of experts for autonomous driving. In: ICRA (2025) 4 , 10 , 11 , 25 39. T an, T., Zheng, Y., Liang, R., W ang, Z., Zheng, K., Zheng, J., Li, J., Zhan, X., Liu, J.: Flow matching-based autonomous driving planning with adv anced interactiv e b eha vior mo deling. In: NeurIPS (2025) 4 , 10 , 11 , 12 , 13 , 14 , 18 , 19 , 25 , 26 40. Tian, T., Li, B., W eng, X., Chen, Y., Sc hmerling, E., W ang, Y., Iv anovic, B., P av one, M.: T okenize the world in to ob ject-level kno wledge to address long-tail ev ents in autonomous driving. In: CoRL (2024) 3 , 5 41. Tian, X., Gu, J., Li, B., Liu, Y., W ang, Y., Zhao, Z., Zhan, K., Jia, P ., Lang, X., Zhao, H.: Driv eVLM: The con v ergence of autonomous driving and large vision- language mo dels. In: CoRL (2024) 1 , 3 , 11 , 13 , 14 , 18 42. T reib er, M., Henneck e, A., Helbing, D.: Congested traﬃc states in empirical ob- serv ations and microscopic simulations. Physical review E (2000) 9 , 11 , 25 43. W u, W., F eng, X., Gao, Z., Kan, Y.: Smart: Scalable m ulti-agent real-time motion generation via next-tok en prediction. NeurIPS (2024) 3 , 5 , 9 44. Xu, Z., Bai, Y., Zhang, Y., Li, Z., Xia, F., W ong, K.Y.K., W ang, J., Zhao, H.: Driv egpt4-v2: Harnessing large language model capabilities for enhanced closed- lo op autonomous driving. In: CVPR (2025) 11 , 14 , 18 45. Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., W ong, K.Y.K., Li, Z., Zhao, H.: Driv egpt4: In terpretable end-to-end autonomous driving via large language model. Rob otics and Automation Letters (2024) 3 46. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical rep ort. arXiv preprint arXiv:2505.09388 (2025) 11 47. Zandieh, A., Daliri, M., Hadian, M., Mirrokni, V.: T urb o quan t: Online v ector quan- tization with near-optimal distortion rate. arXiv preprint arXiv:2504.19874 (2025) 13 48. Zhai, J.T., F eng, Z., Du, J., Mao, Y., Liu, J.J., T an, Z., Zhang, Y., Y e, X., W ang, J.: Rethinking the op en-lo op ev aluation of end-to-end autonomous driving in nuscenes. arXiv preprin t arXiv:2305.10430 (2023) 18 49. Zhang, L., Rao, A., Agraw ala, M.: A dding conditional control to text-to-image diﬀusion mo dels. In: ICCV (2023) 9 50. Zheng, Y., Liang, R., ZHENG, K., Zheng, J., Mao, L., Li, J., Gu, W., Ai, R., Li, S.E., Zhan, X., et al.: Diﬀusion-based planning for autonomous driving with ﬂexible guidance. In: ICLR (2025) 4 , 10 , 11 , 12 , 13 , 14 , 18 , 19 , 25 , 26 51. Zhou, Z., Cai, T., Zhao, Seth Z.and Zhang, Y., Huang, Z., Zhou, B., Ma, J.: Auto vla: A vision-language-action mo del for end-to-end autonomous driving with adaptiv e reasoning and reinforcement ﬁne-tuning. NeurIPS (2025) 18 52. Zitk ovic h, B., Y u, T., Xu, S., Xu, P ., Xiao, T., Xia, F., W u, J., W ohlhart, P ., W elker, S., W ahid, A., et al.: Rt-2: Vision-language-action models transfer w eb kno wledge to rob otic con trol. In: CoRL (2023) 7 , 19 18 A. Ghosh et al. A Design Rationale and Extended Discussion A r chite ctur e & Design Philosophy Q: Ho w do es LAD ’s interruptible architecture enable reasoning b efore planning? A: Prior language-based planners couple action generation to full autoregressiv e text generation, creating a latency-quality tradeoﬀ. LAD ’s in terruptible archi- tecture a voids this: reasoning tokens precede the plan token, allowing the model to “think before acting” within a single forward pass. Our ablations show that adding reasoning tok ens improv es planning quality , indicating that the plan to- k en attends to its textual reasoning c hain. This design naturally extends to richer test-time reasoning strategies as future work, requiring no arc hitectural changes. Q: How is the reasoning budget con trolled at inference? A: In the current ev aluation, the token coun t is a ﬁxed hyperparameter. The k ey prop ert y is that the plan tok en always pro duces a v alid tra jectory regardless of how many reasoning tokens precede it. This makes the system directly com- patible with external safet y monitors that may demand immediate re-planning at an y momen t—the plan is nev er “incomplete.” Q: Wh y is a 0.6B LLM backbone suﬃcien t, and what are the scaling prosp ects? A: Qw en3-0.6B w as c hosen deliberately to satisfy the real-time latency con- strain t while still demonstrating a core hypothesis: that language sup ervision impro v es planning. The ablation in T able 2 conﬁrms that language sup ervision consisten tly impro ves planning qualit y ev en at this scale. Imp ortan tly , the LAD arc hitecture (plan tok en, in terruptible inference, adapters) is mo del-size agnostic; scaling to larger backbones requires no architectural changes, making studying scaling b eha vior a natural direction for future work. Context & Positioning Q: How do es LAD relate to concurren t VLA/VLM planners? A: These methods [ 22 , 41 , 44 , 51 ] target a diﬀerent op erating regime: they process camera images, use larger bac kb ones, and are primarily ev aluated on vision- cen tric b enchmarks or synthetic simulators. LAD addresses a complemen tary setting, real-time language-action planning on nuPlan, the standard closed-lo op b enc hmark. A ccordingly , LAD is compared against the best p erforming metho ds on the n uPlan b enc hmark: [ 8 , 9 , 39 , 50 ]. Q: Why do es LAD use v ectorized inputs rather than camera or Li- D AR inputs? A: End-to-end VLA/VLM planners [ 22 , 41 , 44 ] study a diﬀeren t problem, jointly learning p erception and planning from raw sensor data, and typically rely on op en-loop ev aluation, which do es not reliably predict closed-lo op performance [ 28 , 48 ]. Our work addresses a complemen tary question: whether language sup ervi- sion improv es the planning comp onen t itself. W e ev aluate in nuPlan’s realistic RAD-LAD 19 Fig. 7: LAD demonstrates strong spatio-temp oral understanding when navigating a complex left-turn. In each case, it correctly identiﬁes nearby p edestrians, crosswalks, and turning geometry , and pro duces a safe, smooth tra jectory while articulating its high-lev el inten t (e.g., “turning left,” “near p edestrian on crosswalk”). LAD ’s textual explanations remain consisten t with the scene lay out and the executed motion plan. closed-lo op sim ulator, the standard setting for state-of-the-art planners [ 8 , 9 , 39 , 50 ]. A key adv antage of this setting is scalable language sup ervision: closed-loop sim ulators with logged data enable straightforw ard construction of tra jectory- aligned QA pairs, as we demonstrate with DrivingQA and PlanningQA, lev er- aging b eha vior annotations [ 6 ] as one source of grounding. More broadly , we believe deplo y ed planners will likely consume v arying input conﬁgurations dep ending on a v ailable sensors and infrastructure, from vision and LiD AR alone to full stac k inputs including HD maps and track ed agen ts . Recen t work in rob otics [ 2 , 15 , 52 ] suggests that training across diverse modality com binations yields complementary gains. Language sup ervision from log-replay sim ulators lik e n uPlan oﬀers a particularly p ortable training signal, as it can b e syn thesized regardless of the underlying sensor mo dalit y . LAD ’s mo dality- agnostic adapter pattern supp orts this direction: augmenting the enco der with, e.g., a vision bac kb one is feasible with few changes. Empiric al Insights Q: What is the k ey insigh t behind RAD ’s design? A: RAD contributes b oth a concrete algorithmic improv ement, and the empir- ical ﬁnding it enables: these extensions close most of the gap b et ween PDM- Closed and h uman p erformance on T est14-Hard, demonstrating that existing long-tail b enc hmarks predominantly capture capabilities absent from the base- line planner rather than intrinsic scenario diﬃculty . The delib erate simplicity of RAD is what mak es this insight useful: b ecause the ﬁx is simple, the large p erformance gain can b e attributed directly to the new capabilities rather than to added mo del capacity or data. Q: PDM-Closed already scores 15 prop osals at every timestep. Wh y do es it still need dynamic replanning? 20 A. Ghosh et al. A: PDM-Closed is the strongest existing rule-based planner in the literature but w as designed with a ﬁxed top ology . Despite scoring 15 candidates, PDM-Closed anc hors all of them to prop osal paths ﬁxed at initialization, and there is no top ological replanning after the ﬁrst timestep. If the ego drifts or paths b ecome blo c ked, the underlying top ology is never up dated. The oﬃcial implemen tation reﬂects this design c hoice: # L98 abstract_pdm_closed_planner.py # TODO: Find additional conditions to trigger re-planning create_new_proposals = self._iteration == 0 As a result, PDM-Closed does not supp ort lane c hanges or oﬀ-route recov ery . RAD addresses exactly this: it performs full top ological replanning at ev ery timestep, regenerating proposal paths from the curren t ego state and augmenting them with adjacen t-lane cen terlines. Q: How do RAD and LAD complemen t eac h other? A: The tw o planners address diﬀerent axes of diﬃculty . Many T est14-Hard sce- narios are geometric in character, whic h RAD already handles w ell, so the hybrid gain on this s plit is mo dest. On InterPlan, which tests multi-agen t interaction, the impro v ement from adding LAD is more pronounced, indicating that the lan- guage component con tributes most in seman tically complex scenarios. Bey ond n umeric scores, LAD provides interpr etable r e asoning for every decision, which is a critical capability for deploymen t that pure rule-based systems cannot oﬀer. Q: Do the improv emen ts generalize b ey ond long-tail b enc hmarks? A: Y es. As sho wn in T able 9 , RAD marginally impro ves ov er PDM-Closed on V al14, conﬁrming that lane-c hange capability and dynamic replanning do not harm general driving. Among imitation-based learned planners, LAD outp er- forms PlanTF on V al14 as well. Notably , RAD ’s tra jectory v o cabulary is drawn from the full n uPlan training set, so its strong T est14-Hard p erformance does not reﬂect ov erﬁtting to rare maneuv ers. The h uman-gap analysis in T able 11 reinforces this: both planners remain close to h uman p erformance on V al14, y et PDM-Closed’s gap on T est14-Hard is roughly 4 × larger than RAD ’s. This insigh t indicates that most of the “diﬃculty” captured b y T est14-Hard reﬂects capabilities outside PDM-Closed’s design scope, while RAD ’s improv ements gen- eralize across b oth standard and long-tailed conditions. R epr o ducibility Q: What is the latency measurement metho dology? A: Prior work has noted that LLM-based planners face “signiﬁcant c hallenges, including elev ated resource consumption and extended inference times, whic h p ose substan tial obstacles to practical deploymen t” [ 7 ]. All closed-lo op planners in T able 3 (PlanTF, DiﬀusionPlanner, Flo wPlanner, and LAD ) are measured on the same A6000 hardware under identical ev aluation conditions. Our results sho w that this limitation is not fundamental: LAD is fast enough for close d- lo op deployment (10 Hz with reasoning, 20 Hz without), and among closed-lo op planners ev aluated on this hardware, LAD is comp etitiv e while b eing the only RAD-LAD 21 Fig. 8: LAD handles div erse road conﬁgurations b y accurately understanding lane top ology , con trol rules, and surrounding agents. Its generated reasoning reﬂects this situational aw areness (e.g., “on stopline stop sign,” “crossing an in tersection”, “following lane without lead”), and its tra jectory c hoices align with these high-lev el descriptions. metho d that also pro duces interpretable reasoning. Driv e VLM’s rep orted Orin latency is included only for reference to concurren t language-based planners. Q: Will co de, mo dels, and data b e released? A: Y es. W e will release co de and mo del w eigh ts. B LAD : Language Based Autonomous Driving B.1 Structured Reasoning for Inference T o satisfy strict latency constrain ts while retaining interpretabilit y , LAD uses structur e d r e asoning templates that restrict generation to task-relev ant tokens. Instead of open-ended text generation for the planning reasoning, we use tem- plates with designated ﬁll-in ﬁelds, Ego is {scenario_type} and is {meta_action}<|plan|> . The mo del generates only the tokens within curly braces (e.g., meta actions like following_lane , turning_left ). Do note that the DrivingQA dataset is op en ended and is part of the data-mix while training. T r aining Augmentation. During training, we randomly truncate reasoning mid- generation b efore the <|plan|> tok en. This augmentation ensures the mo del learns to pro duce v alid plans regardless of how m uch reasoning con text is av ail- able. W e observed no drop in planning performance from this truncation strategy . Multi-T urn Cap ability. While LAD is trained exclusively on single-turn QA pairs from In teraction-QA dataset, it retains the multi-turn conv ersational ability of its base language mo del (Qwen3-0.6B). The examples in Figure 6 demonstrate this emergen t capabilit y at inference time. 22 A. Ghosh et al. LAD Plan TF Fig. 9: LAD produces safe and consisten t tra jectories through this complex round- ab out while simultaneously articulating its high-lev el in tent (e.g., “ego vehicle is trav ers- ing pickup-dropoﬀ and is turning left”). In contrast, PlanTF [ 9 ] frequen tly hesitates or commits to suboptimal maneuv ers. LAD ’s textual reasoning aligns with its chosen motion plan, providing interpretable justiﬁcation for eac h action. B.2 Latency Measurement Metho dology Our latency measurements reported in T able 3 represent end-to-end infer enc e time , measured from receiving vectorized inputs (map elemen ts, agent states) to extracting the ﬁnal wa yp oint tra jectory . This includes the PlanTF enco der forw ard pass, MLP adapter pro jection, language model preﬁll (and optional autoregressiv e deco ding for reasoning tokens), planning head forward pass, and tra jectory vocabulary lookup. Measurements are a v eraged o ver 1000 warm-start inference calls. The rep orted times do not include data loading or ra w sensor pre-pro cessing. B.3 More Visual Comparisons W e presen t a few scenarios where LAD shows goo d spatio-temp oral understand- ing of the map and other agents around while navigating (See Figure 7 and Fig- ure 8 ). W e also present some visual comparisons b etw een LAD and PlanTF [ 9 ] in Figure 9 and Figure 10 . W e also show more visual results in our associated supplemen tary video. RAD-LAD 23 LAD Plan TF Fig. 10: LAD correctly infers from the scene that the ego lane is blo c ked b y static ob- jects. It then selects the farther un blo ck ed lane and executes a safe maneuver through the intersection, identifying relev ant dynamic ob jects (“long vehicle”). In contrast, PlanTF [ 9 ] con tinues to follo w the block ed lane or hesitates, failing to accoun t for the static obstacles. C RAD Ablation Studies In this section, w e pro vide ablation studies of the v arious algorithmic nov elties relativ e to PDM-Closed. The eﬀectiveness of these components is v alidated on the n uPlan T est14 and In terPlan subsets that do not ov erlap with the bench- marking subsets. C.1 Ablation Results Dynamic T op olo gy R eplanning W e v alidate the impact of replanning in T able 4 . Enabling dynamic replanning provides a noticeable p erformance b oost, improv- ing the reactiv e score on the T est14-Random-Clean split. T able 4: RAD Ablation: Dynamic Replanning. Replan T est14-Sub (R) × 92.90 ✓ 93.92 24 A. Ghosh et al. L ane-Change Cap ability As demonstrated in T able 5 , explicitly mo deling adja- cen t cen terlines impro v es the planner’s ability to na vigate complex traﬃc. T able 5: RAD Ablation: Adjacen t Centerlines. A dj. Centerlines T est14-Sub (R) × 94.40 ✓ 94.51 A ggr essive Go al-Dir e cte d Optimization T able 6 sho ws the impact of the goal- directed cost term. W e ev aluate this comp onen t on the InterPlan benchmark, whic h contains challenging scenarios requiring assertiv e na vigation to av oid dead- lo c ks. In these settings, reliance on standard progress metrics can lead to pas- sivit y; the goal-directed term is crucial for driving the v ehicle through complex in teractions. T able 6: RAD Ablation: Goal-Directed Optimization. Goal Opt. InterPlan-Sub × 84.31 ✓ 90.22 T r aje ctory Pr op osal Augmentation T able 7 highlights the eﬀectiveness of this augmen tation. T able 7: RAD Ablation: V o cabulary Augmen tation. V o cab Aug. T est14-Sub (R) × 92.75 ✓ 94.51 Context-A war e R ule R elaxation This context-a ware ﬂexibility is particularly crit- ical for c hallenging negotiation scenarios found in the InterPlan dataset, as seen in T able 8 . W e utilize InterPlan for this ablation b ecause its high density of dynamic agen ts and potential block ages necessitates deviations from strict lane- follo wing rules—capabilities that are less critical in standard op en-road driving but essen tial for solving these corner cases. RAD-LAD 25 T able 8: RAD Ablation: Rule Relaxation. Rule Relaxation InterPlan-Sub × 84.31 ✓ 92.99 T able 9: V al14 (Reactive) results. RAD marginally improv es o v er PDM-Closed, conﬁrming that dynamic replanning and lane-c hange capability do not degrade nominal driving qualit y . LAD outp erforms PlanTF among learned planners, demonstrating that language sup ervision preserv es normal driving p erformance. T yp e Planner V al14 (R) Exp ert Log Repla y 93.68 Rule IDM [ 42 ] 79.31 PDM-Closed [ 10 ] 92.12 RAD (Ours) 92.31 Learned PlanTF [ 9 ] 77.07 PLUTO [ 8 ] 80.01 DiﬀusionPlanner [ 50 ] 82.80 Flo wPlanner [ 39 ] 83.31 LAD (Ours) 78.40 Hybrid PLUTO [ 8 ] 87.00 STR2-CKS-800m [ 38 ] 92.12 DiﬀusionPlanner [ 50 ] 92.90 Flo wPlanner [ 39 ] 92.38 RAD - LAD (Ours) 92.35 D A dditional Results D.1 Results on Regular Driving Situations T able 9 rep orts p erformance on n uPlan V al14 [ 5 ], the standard split representa- tiv e of normal driving situations. RAD marginally improv es ov er PDM-Closed, conﬁrming that dynamic replanning and lane-c hange capability do not degrade nominal driving quality . Among imitation-based learned planners, LAD outp er- forms PlanTF, demonstrating that language sup ervision do es not harm normal driving p erformance. D.2 Impact of Do wnstream Con troller Prior work [ 9 ] suggested the existence of a hidden imitation gap which arises due to the discrepancy b et w een tra jectory-based planner and the downstream con troller. The exp ert tra jectory serv es as the ground truth during the train- ing of the imitation-based planner. The predicted tra jectory is processed b y a 26 A. Ghosh et al. T able 10: Impact of Do wnstream Controller. Replacing the default LQR con- troller with iLQR improv es LAD ’s score by 4.07, consistent with the hidden imita- tion gap arising due to not accoun ting for downstream controller [ 9 ]. Thus, con troller- a wareness is an imp ortant consideration for tra jectory-based planners. Planner Con troller V al14 (R) LAD LQR 78.40 LAD iLQR 82.47 do wnstream con troller (could b e LQR or iterative LQR or some other algorithm) and the underlying system dynamics, which are not considered during training. Th us, during roll out in closed-lo op ev aluation, this discrepancy may lead to a decrease in planning p erformance as predictions do not consider what can b e actually actuated or trac k ed b y the do wnstream controller. In T able 10 , we conﬁrm this hidden imitation gap is the reason imp eding LAD ’s p erformance in normal driving situations. Changing the downstream con- troller from LQR to a more robust iterative LQR (iLQR) controller impro ves p erformance by 4.07% closing the gap b et w een LAD and other state-of-the-art imitation metho ds [ 39 , 50 ]. Ho w ev er, we note that these iLQR results depart from the standard nuPlan ev aluation proto col 1 , which couples a ﬁxed LQR controller with the sim ulation. As our results sho w, this coupling is suboptimal: the LQR controller degrades the closed-lo op performance of tra jectory-based planners. Prior work [ 23 ] cor- rob orates this ﬁnding, showing that even the human exp ert tra jectory loses ∼ 3 p oin ts under the default LQR controller compared to a more accurate iLQR con troller. This suggests that nuPlan scores partially reﬂect controller qualit y rather than planning qualit y alone. W e concur with prior w ork [ 23 ] that simulators and benchmarks for au- tonomous driving should decouple control from simulation. The c hoice of con- troller is inheren tly tied to the planner’s output represen tation, an observ ation that has long motiv ated end-to-end approac hes to autonomous driving [ 4 , 20 , 32 ]. Bridging this gap is an orthogonal researc h direction: our w ork focuses on im- pro ving planning through language sup ervision. Addressing the con trol in ter- face, whether through alternative output representations [ 23 ], learning the con- troller [ 9 ], smo other action execution strategies [ 3 ], or b enc hmark redesign, is a complemen tary but distinct line of w ork. D.3 Do es RAD o verﬁt to long-tailed situations? T able 11 compares the p erformance gap b etw een planners and h uman exp ert driving on V al14 versus T est14-Hard. While both RAD and PDM-Closed ac hieve near-h uman p erformance on V al14, PDM-Closed exhibits a notably larger gap on the long-tailed splits. 1 https://nuplan- devkit.readthedocs.io/en/latest/competition.html RAD-LAD 27 T able 11: Human-gap analysis across splits. Both RAD and PDM-Closed achiev e near-h uman performance on V al14, but RAD maintains a signiﬁcantly smaller gap on T est14-Hard, suggesting that RAD ’s lane-c hange and replanning capabilities generalize to long-tail scenarios without ov erﬁtting. V al14 T est14-Hard Metho d Score Human Gap Score Human Gap PDM-Closed 93.20 0.80 75.19 10.77 RAD 92.31 1.69 80.53 5.43 Human (Exp ert) 94.00 – 85.96 – Crucially , RAD ’s improv ements stem from general-purp ose driving capabil- ities, i.e., lane c hanges, dynamic top ology replanning, and goal-directed opti- mization, rather than heuristics tailored to sp eciﬁc failure mo des. The asymmetry in human gaps is rev ealing. Both metho ds are near-human on V al14, so baseline planning qualit y is comparable. On T est14-Hard, ho w ev er, PDM-Closed’s gap widens to roughly 2 × RAD ’s gap. Since RAD diﬀers from PDM-Closed only in structural capabilities, this excess gap indicates that many T est14-Hard scenarios are “hard” not b ecause of inherent complexit y , but because PDM-Closed lac k ed certain structural capabilities. Finally , RAD ’s V al14 score marginally impr oves ov er PDM-Closed, ruling out the ov erﬁtting hypothesis. If RAD were specialized to long-tail scenarios at the exp ense of normal driving, w e would exp ect a regression on V al14. Instead, the consistent or impro ved performance across b oth splits conﬁrms that RAD ’s gains are attributable to impro v emen ts in general planning capabilities. E Datasets E.1 DrivingQA Dataset T raining m ultimo dal language mo dels for autonomous driving requires grounded question-answ ering data that captures the nuanced dynamics of multi-agen t in- teractions. Existing driving QA datasets often focus on ob ject recognition or simple scene descriptions, lacking structured annotations for e go-cen tric plan- ning decisions and in ter-agen t relationships. T o address this gap, w e introduce DrivingQA , a syn thetically generated instruction-tuning dataset built on top of NuPlan [ 5 ]. DrivingQA contains 1.2 million question-answ er pairs spanning 3.3 million driving scenarios tak en from 8,457 nuplan scenes, with explicit anno- tations for ego vehicle plans, agen t-of-interest (A OI) b ehaviors, and m ulti-agent in teractions. Data Sour c es. W e leverage heuristic and h uman annotated b eha vior annotations from In terDrive [ 6 ] which pro vide structured p er-agen t b eha vior lab els (e.g., lane p osition, turn inten t, sp eed state, intersection b eha vior) for n uPlan scenes (a 28 A. Ghosh et al. scene is a 20 second driving log). W e additionally emplo y GPT-4o rephrased descriptions that in tro duce natural language div ersity for the same underlying b eha viors. W e associate eac h scene with one or more NuPlan scenario t yp es (from a taxonomy of types such as starting_left_turn ), along with precise temp oral annotations indicating when each sc enario o ccurs within the 20-second scene ( min_time , max_time in seconds). Gener ation Pip eline. F or each scene with annotated interactions, we construct structured prompts for three sub ject categories: the e go vehicle , the agent of inter est (the primary in teracting agent), and other agents presen t in the scene. Eac h prompt includes the sub ject’s b ehavior description, in teraction context, and scenario t yp e metadata. W e employ Qwen2.5-32B [ 1 ] to generate 2 to 4 QA pairs p er ego/agent of interest sub ject and 1 to 3 pairs for every other agent. T o ensure entity-agnostic generalization, all agen ts are referenced using sp e- cial tok ens: <|ego|> for the ego vehicle, <|agent_of_interest|> for the pri- mary interacting agent, and <|agent|> for all other agen ts. While training, w e replace these tokens with the numerical ID’s assigned to each agent. Each answ er is tagged with a prov enance lab el, either state d (directly from annotations), de- duc e d (inferred from con text), or unknown (insuﬃcient evidence), whic h enables us to p erform ﬁltering b efore training. Question-A nswer Diversity. The system prompt encourages lexical and struc- tural div ersit y , v arying in terrogativ es (what/whic h/ho w/wh y/do es/can), para- phrasing answ ers, and including negative/coun terfactual questions (e.g., “Is Agent <|ego|> turning left?” when the ground truth is going straight). Questions span m ultiple categories including planning (ego maneuver in tent), b ehavior (agent motion states), inter action (yielding, priority , relativ e positioning), and sp atial (lane o ccupancy , intersection trav ersal). E.2 PlanningQA Dataset While DrivingQA (Section E.1 ) pro vides broad scene understanding through op en-ended question-answ er pairs, it lacks temp oral grounding to the ego vehi- cle’s immediate action. PlanningQA addresses this gap by pairing each training tra jectory with a short textual description of the ego’s b ehavior, temp orally aligned with the ground-truth wa yp oin ts. This provides the mo del with action- conditioned language sup ervision: text that describes what the e go is doing when the plan is exe cute d , rather than general scene-level information. Sc enario T yp e. Each n uPlan scenario carries a sc enario typ e lab el drawn from n uPlan’s oﬃcial taxonomy of 75 t yp es [ 5 ]. These types are algorithmically mined from driving logs via atomic even t primitiv es (e.g., intersection entry , high lateral acceleration, dense traﬃc) and cov er b oth frequent maneuvers (e.g., following_ lane_with_lead , starting_left_turn , changing_lane ) and rare long-tail ev ents (e.g., near_miss , waiting_for_pedestrian_to_cross , traversing_pickup_ dropoff ). Because n uPlan nativ ely pro vides temp oral b ounds ( min_time , max_ RAD-LAD 29 System Directive You generate high-quality instruction-tuning QA data for self-driving scenarios. Assume your role is of a Planner agent reasoning about the scene and the agents in it. Entity References: In all questions and answers, use these exact references: – ego vehicle: ‘Agent <|ego|>’ – primary interacting agent: ‘Agent <|agent_of_interest|>’ – any other agent: ‘Agent <|agent|>’ Never use raw IDs. IDs belong only in metadata. Scene Context: A scene is a 20-second driving segment. Scenarios are triggered by ego behavior at specific time instants. Each scenario has a type from the NuPlan taxonomy. Grounding Rules: Only use facts in the provided JSON. If a detail is not stated and cannot be deduced, answer ‘Unknown’ with answer_source ‘unknown’. Do not refer to the input text in your answers. Diversity: Vary interrogatives (what/which/how/why/does/can), paraphrase answers, include negative/counterfactual questions. Avoid repeating the same template. Output: Valid JSON array with schema: {question, answer, answer_source, tags}. User Prompt (p er-sub ject) Task: Produce 2 to 4 QA items for the Ego subject. Additional Instructions: Ask about current state, scenario type, plan, interactions with agent of interest. Scene token: {scene_token} Scenario type aggregates: (context only, do NOT ask about min_time/max_time) [{"scenario_type": "on_intersection", "agent_track_token": null, "min_time": 8.15, "max_time": 9.65}, {"scenario_type": "traversing_intersection", ...}] Subject: ego, Type: VEHICLE Has interaction: true Input data (JSON): {"reasoning_text": "Is in middle lane, crossing an intersection, going straight.", "ego": {"other_agent_token": "<|nuplan_token|>"}, "has_interaction": true} Output Schema [{"question": "What maneuver is Agent <|ego|> planning?", "answer": "Go straight through the intersection.", "answer_source": "stated", "tags": ["planning", "intersection"]}, ...] Fig. 11: DrivingQA generation prompt. T emplate used to synthesize tra jectory- grounded QA pairs from driving logs, for LAD ’s scalable language sup ervision pipeline. 30 A. Ghosh et al. time ) for each scenario within a scene, we can asso ciate the correct scenario t yp e, ensuring alignment b et ween the lab el and the ground-truth tra jectory . Meta-A ction. In addition to the scenario t yp e, w e annotate each planning timestep with a meta-action label describing the ego v ehicle’s high-level b ehavioral inten t. These lab els are derived from the heuristic single-agent b ehavior annotations pro vided by In terDriv e [ 6 ], whic h assigns structured p er-agen t lab els (e.g., lane p osition, turn in tent, sp eed state, in tersection b eha vior) based on calibrated ge- ometric and kinematic heuristics applied to the nuPlan driving logs. W e extract and map the ego-relev ant subset of these annotations to a compact set of meta- actions suc h as following_lane , turning_left , turning_right , stationary , and lane_change . T emplate F ormat. During training, PlanningQA sup ervision is pro vided as a short structured sentence preceding the <|plan|> token. Because the template is short and ﬁxed-format, it adds minimal latency during inference w hile provid- ing action-aligned textual sup ervision that is complementary to the op en-ended DrivingQA pairs.

RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment