Dual-Stage LLM Framework for Scenario-Centric Semantic Interpretation in Driving Assistance
Advanced Driver Assistance Systems (ADAS) increasingly rely on learning-based perception, yet safety-relevant failures often arise without component malfunction, driven instead by partial observability and semantic ambiguity in how risk is interprete…
Authors: Jean Douglas Carvalho, Hugo Taciro Kenji, Ahmad Mohammad Saber
1 Dual-Stage LLM Frame work for Scenario-Centric Semantic Interpretation in Dri ving Assistance Jean Douglas Carv alho , Hugo T aciro K enji , Ahmad Mohammad Saber , Glaucia Melo , Max Mauro Dias Santos , and Deepa Kundur Abstract —Advanced Driver Assistance Systems (AD AS) in- creasingly rely on learning-based perception, yet safety-relevant failures often arise without component malfunction, driv en in- stead by partial observability and semantic ambiguity in how risk is interpreted and communicated. This paper presents a scenario- centric framework for repr oducible auditing of LLM-based risk reasoning in urban dri ving contexts. Deterministic, temporally bounded scenario windows are constructed from multimodal driving data and ev aluated under fixed prompt constraints and a closed numeric risk schema, ensuring structured and comparable outputs across models. Experiments on a curated near-people scenario set compare two text-only models and one multimodal model under identical inputs and prompts. Results reveal sys- tematic inter -model diver gence in severity assignment, high-risk escalation, evidence use, and causal attrib ution. Disagreement extends to the interpretation of vulnerable road user presence, indicating that variability often reflects intrinsic semantic in- determinacy rather than isolated model failure. These findings highlight the importance of scenario-centric auditing and explicit ambiguity management when integrating LLM-based reasoning into safety-aligned driver assistance systems. Index T erms —Advanced Driver Assistance Systems, Large Language Models, Scenario-Based Evaluation, Semantic Risk Interpr etation, Explainable, Multimodal Perception I . I N T R O D U C T I O N Advanced Driver Assistance Systems (AD AS) are increas- ingly deployed to enhance road safety by supporting drivers in perception, hazard anticipation, and decision-making. Despite substantial progress in sensing technologies and learning- based perception, the reliability of AD AS remains funda- mentally constrained in real-world urban environments, where occlusions, illumination variability , and dense interactions with vulnerable road users (VR Us) are common [1, 2, 3]. In such settings, safety risks frequently arise not from component failures b ut from limitations in how complex traffic situations are perceived, interpreted, and communicated to driv ers. From a safety engineering standpoint, these limitations are not fully addressed by traditional Functional Safety frame- works, such as ISO 26262, which primarily address hazards arising from system malfunctions [4]. Instead, man y critical AD AS risks arise when systems operate as intended yet exhibit insufficient performance under ambiguous, uncertain, or unforeseen conditions. This class of fault-free hazardous behavior is explicitly addressed by the Safety of the Intended Functionality (SO TIF) frame work standardized in ISO 21448, which highlights perception limitations, semantic ambiguity , and foreseeable misuse as primary safety concerns [5]. Con- sequently , understanding and managing semantic uncertainty in perception and interpretation has become a central challenge for safety-oriented AD AS validation. Addressing this challenge increasingly requires moving be- yond raw sensor streams toward structured, scenario-centric representations. Scenario abstraction enables the decomposi- tion of complex driving interactions into discrete, comparable, and auditable units, thereby supporting systematic and repro- ducible e v aluation at scale. Frameworks such as MetaScenario formalize this concept by organizing multimodal driving data into semantically indexed scenario representations, facilitating consistent comparison across heterogeneous conditions while preserving real-world comple xity [6]. Importantly , such rep- resentations focus on scenario description and organization, remaining largely agnostic to how higher-le vel reasoning com- ponents interpret risk and intent within those scenarios. W ithin this scenario-centric paradigm, recent advances in Large Language Models (LLMs), including multimodal vari- ants, introduce new opportunities for high-le vel semantic interpretation of driving situations. Unlike con ventional per- ception models that operate at the signal or object-detection lev el, LLMs can integrate heterogeneous contextual cues, such as object relations, ego-vehicle dynamics, infrastructure attributes, and en vironmental context, into structured, human- interpretable assessments of traffic situations [7, 8, 9]. In this work, LLMs are not treated as perception, control, or decision- making modules, b ut as interpretativ e and analytical compo- nents capable of reasoning over scenario-bounded contextual evidence [8, 10]. Howe ver , introducing LLMs into safety-critical driving con- texts raises fundamental questions regarding reliability , con- sistency , and safety alignment. Different models may produce div ergent semantic interpretations when exposed to identical traffic scenarios, particularly under partial or ambiguous per- ception, leading to variability in risk attribution and VR U assessment [11, 12]. Rather than representing isolated model errors, such di vergences may reflect intrinsic semantic am- biguity in the av ailable evidence. W ithout a structured and reproducible ev aluation framework, these dif ferences remain difficult to quantify , compare, or audit in a way that supports safety-oriented development. Motiv ated by these challenges, this work inv estigates how LLM-based semantic reasoning behaves when exposed to identical real-world dri ving scenarios under deterministic, scenario-centric representations and constrained semantic out- puts. By grounding the analysis in reproducible, scalable sce- nario abstractions deri ved from multimodal driving data, the proposed frame work enables systematic, large-scale compari- 2 son of inter-model differences in risk interpretation, evidence use, and sensitivity to vulnerable road users. In doing so, this study positions LLMs as interpretable cognitiv e probes for AD AS ev aluation, providing a controlled methodology to expose semantic ambiguity and assess its implications for safety-aligned system design. The remainder of this paper is or ganized as follo ws: Section II establishes the technical and conceptual foundations of the work, characterizing limitations in con ventional AD AS pipelines and formalizes the problem of context-dependent perception. Section III describes the multimodal data platform and the preprocessing pipeline used to generate the normal- ized 1 Hz semantic substrate. Section IV details the system architecture, focusing on the deterministic scenario construc- tion and the prompted LLM e valuation framework. Section V presents the e xperimental setup, including the collection protocol for safety-critical "near-people" scenarios and the language models ev aluated. Section VI provides a quantitativ e and qualitativ e analysis of the experimental results, followed by the concluding remarks in Section VII. I I . B AC K G R O U ND A N D P RO B L E M S TA T E M E N T This section outlines the technical foundations of the pro- posed frame work and motiv ates scenario-centric auditing of semantic risk interpretation in ADAS. AD AS rely fundamentally on accurate perception and timely interpretation of complex traffic scenes to support safe dri ving decisions [13, 14]. Modern perception pipelines predominantly employ camera- and radar -based sensing combined with com- puter vision and signal processing algorithms to detect objects, estimate motion, and infer dri ving context [15, 16]. Despite significant progress in deep learning–based perception, cur- rent ADAS architectures remain largely reactiv e and context- agnostic under rapidly changing environmental conditions such as lo w illumination, adverse weather , occlusions, or dense urban traffic [1, 2, 3]. Beyond raw perception performance, a core practical limi- tation is that many AD AS deployments provide limited trans- parency re garding ho w warnings are triggered and ho w risk is framed in ambiguous situations. This becomes particularly problematic in mixed traf fic and partial observability condi- tions, where human dri vers expect context-aw are and inter- pretable feedback [17, 18]. As a result, the safety rele v ance of an ADAS output is not solely determined by whether objects are detected, but also by how risk is semantically interpreted and communicated under uncertainty . A. Limitations of Con ventional ADAS Pipelines Most existing AD AS pipelines decouple perception, sensor control, and decision-making into independent modules [19]. V ision models such as Y OLO or semantic segmentation net- works operate on raw sensor inputs with limited awareness of sensor configuration or en vironmental semantics [1, 2, 20]. Sensor calibration, when present, is typically performed of- fline or through predefined rule-based logic, which constrains adaptability and generalization across diverse conditions. From an engineering perspecti ve, this modular decoupling yields two persistent gaps. First, do wnstream risk reasoning (whether implemented as rules, heuristics, or learned modules) may inherit perceptual uncertainty without a systematic mech- anism to expose ho w ambiguity propagates into safety-relev ant interpretations. Second, explainability often remains limited: the system may trigger warnings or interventions without con veying the underlying rationale in a form that supports user trust and post hoc analysis [17, 18]. Formally , con ventional approaches optimize perception models for detection accuracy under fixed assumptions: min ϕ E L f ϕ ( y t ) , x t (1) where f ϕ denotes a perception model with parameters ϕ . Notably , this formulation treats sensor behavior as exogenous, omitting adapti ve control over θ s . While sensor adaptation and perception robustness are activ e research topics, they are orthogonal to the present work: this paper does not propose sensor adaptation or vehicle control mechanisms. Instead, it focuses on the subsequent layer , how higher-le vel semantic risk interpretation can di verge even when the input e vidence is held fixed. T o address structural limitations in scenario handling, scenario-centric representations hav e been proposed in the literature, decomposing complex driving situations into dis- crete, comparable, and auditable units. Frameworks such as MetaScenario [6] formalize scenarios as first-class entities, enabling systematic description, indexing, and retrie val of driving situations across datasets and experiments. While such representations significantly improv e reproducibility and com- parability at the scenario level, they remain largely agnostic to how higher-le vel reasoning components interpret risk, intent, and causality within those scenarios. Consequently , the use of scenario-centric abstractions to explicitly expose semantic ambiguity and inter-model div ergence in risk interpretation remains an open problem. B. P er ception as a Context-Dependent Estimation Pr oblem From a systems perspective, perception in ADAS can be modeled as a state estimation problem in which the vehicle infers a latent scene state x t from noisy sensor observ ations y t [21]: y t = h ( x t , θ s ) + ν t (2) where h ( · ) represents the sensor observation model, θ s de- notes sensor configuration parameters (e.g., camera exposure, ISO, shutter speed, radar range), and ν t captures measurement noise. In practice, θ s is commonly fixed or adjusted via offline calibration and handcrafted heuristics, implicitly assuming quasi-stationary environmental conditions [20]. Real-world driving violates this assumption. En vironmental factors such as illumination, weather , road geometry , and traffic density dynamically alter the effecti ve observa tion pro- cess, thereby degrading detection confidence and segmenta- tion fidelity when parameters remain static. This degradation propagates do wnstream, increasing the likelihood of missed detections, false alerts, or semantically ambiguous e vidence. Importantly , ev en when perception outputs are available, their 3 interpretation at the le vel of risk, intent, and hazard type may remain underdetermined, a property that becomes critical when assessing safety behavior in complex urban scenes [1, 2, 3]. In this work, these perception-related uncertainties are treated as upstream conditions that moti vate controlled analy- sis. The goal is not to optimize h ( · ) or adapt θ s , but to study how fixed, scenario-bounded evidence can nonetheless yield div ergent high-level risk interpretations. C. Emer gence of Semantic Reasoning and Language Models Recent advances in Large Language Models (LLMs), par - ticularly multimodal variants, introduce a new capability: se- mantic reasoning ov er heterogeneous e vidence streams [7, 8]. Unlike traditional perception models that operate primarily in feature space, LLM-based components can integrate visual cues, vehicle telemetry , and conte xtual knowledge to generate structured interpretations of scene semantics, risk factors, and intent [9, 10]. Prior work has explored LLMs for post hoc interpretation, human–machine interfaces, and offline analysis in autonomous driving contexts [7, 8]. Although the broader literature has also discussed leveraging high-le vel semantic representations for adaptiv e sensing or orchestration, such control-oriented roles are explicitly outside the scope of this paper . Here, LLMs are treated strictly as interpretativ e and analytical components: giv en identical, scenario-bounded contextual e vidence, the objectiv e is to characterize ho w different models frame risk, select e vidence, and attrib ute causes in the face of semantic ambiguity . This framing emphasizes a distinct validation challenge: if LLM-based interpretati ve modules are to support safety argu- ments, their behavior must be consistent enough to be audited, or at a minimum, their di ver gences must be measurable and explainable under controlled conditions. D. Safety , SO TIF , and Cybersecurity F oundations for LLM- Enabled ADAS The integration of LLM-based components into AD AS must be framed within established automoti ve safety and security standards. In safety-critical vehicular systems, nom- inal correctness is insufficient; behavior must be demon- strably safe, predictable, and robust under both fault condi- tions and performance limitations. Accordingly , the analysis of LLM-augmented ADAS architectures is commonly dis- cussed through the combined lenses of Functional Safety (ISO 26262) [4], Safety of the Intended Functionality , SO TIF (ISO 21448) [5], and Cybersecurity (ISO 21434) [22]. Rather than asserting that these standards “fail” to address perception limitations, a more precise interpretation is that they acknowledge uncertainty and performance limitations (particularly under SO TIF), while lea ving open the question of ho w semantic ambiguity should be exposed, compared, and audited in concrete engineering workflows. In this sense, the present work is complementary: it neither proposes a certified safety solution nor modifies the standards. Instead, it pro vides a controlled analytical methodology that makes ambiguity and interpretativ e div ergence explicit, thereby supporting more structured safety arguments when LLM-based interpretativ e modules are considered. E. Pr oblem Statement and Researc h Gap The considerations abov e reveal a critical gap in current AD AS research and validation practices. While perception pipelines and emerging semantic reasoning components can produce rich interpretations of driving scenes, there is no established framew ork that systematically exposes, compares, and audits these interpretations under identical real-world conditions. In particular , without structured, scenario-centric representations, it becomes difficult to isolate how semantic ambiguity in upstream evidence propagates into higher -lev el risk reasoning, and whether observed disagreements reflect intrinsic ambiguity or model-specific bias. This paper addresses this gap by adopting scenario-based semantic abstraction as the fundamental unit of analysis. Driving situations are represented as deterministic, tempo- rally bounded scenarios that can be reproduced, queried, and ev aluated consistently across experiments. This design enables systematic, scalable ev aluation, allo wing lar ge collections of identical traf fic scenarios to be analyzed under controlled conditions without confounding v ariation in input evidence. W ithin this framework, identical scenarios are presented to different LLMs under fixed prompts and bounded contextual inputs. Rather than treating disagreement as an isolated model failure, diver gences in semantic interpretation are explicitly captured as measurable analytical signals. The following sec- tions operationalize this approach through a multimodal data platform, a deterministic scenario-construction pipeline, and an ev aluation protocol designed to support reproducible, lar ge- scale inter-model comparison. I I I . D A TA S E T D E S C R I P T I O N A N D P R E - P RO C E S S I N G P I P E L I N E The effecti veness of the proposed framew ork relies on transforming raw , high-frequency sensor data into structured, scenario-centric representations that can be audited for se- mantic consistency . T o bridge the gap between low-le vel signal processing and high-lev el cognitiv e reasoning, we hav e dev eloped a data-centric architecture that normalizes and aligns heterogeneous input streams. This section details the multi-layered pipeline for consolidating these sources into a unified temporal substrate, ensuring that subsequent scenario- based ev aluations remain reproducible across div erse dri ving en vironments and model configurations. A. Multimodal Data Platform The platform is built upon a proprietary multi-modal dataset collected from instrumented vehicles operating in real urban en vironments, av ailable at carcara.com. The dataset integrates visual perception, vehicle telemetry , and external contextual information, and deliberately adopts a sensing configuration that excludes LiD AR sensors. This architectural choice re- flects a trade-of f that prioritizes reduced instrumentation cost, simplified deployment, and high replicability across heteroge- neous vehicles and urban contexts, while remaining sufficient 4 to support semantic scene understanding and risk-oriented in- terpretation tasks. Lo wering the marginal cost and complexity of sensing is widely recognized as a ke y factor in enabling the collection of representative real-world driving data at scale, particularly when data acquisition spans multiple vehicles and long operational periods. The primary contribution lies in the architectural org ani- zation of heterogeneous data into a scenario-centric semantic abstraction that is independent of the specific sensing con- figuration. In this sense, the proposed pipeline is not tied to a particular dataset or sensor suite; rather , it defines a generalizable representation strategy that can be applied to any multimodal driving dataset that provides synchronized perception, vehicle state, and contextual information. V isual modalities operate at higher acquisition rates and provide the primary source of perceptual grounding. V ehicle- mounted cameras capture video at approximately 15 fps. These visual streams are processed to e xtract object detections, semantic scene segmentation, and lane geometry estimates. Although computed at video frequency , these outputs are not retained as raw frame-level data, but as compact semantic descriptors suitable for multimodal alignment. Despite heterogeneity in sampling rates across modali- ties, all data streams are ultimately aligned to a common temporal resolution of 1 Hz. This temporal alignment con- stitutes a deliberate abstraction step that enables consistent multimodal fusion and subsequent scenario-based reasoning, while av oiding unnecessary dependence on high-frequenc y raw signals. Rather than treating the dataset as a collection of independent sensor logs, the platform or ganizes all av ailable information into a temporally structured semantic substrate, explicitly designed to support scenario-le vel interpretation and reproducible ev aluation. Figure 1 summarizes this abstraction process, illustrating ho w heterogeneous multimodal inputs are consolidated into a unified 1 Hz semantic representation. B. Pr ocessing and Fusion 1) Ingestion and Offline Heavy Pr ocessing: V isual percep- tion tasks include semantic scene segmentation, lane geometry extraction, and object detection, augmented by a tracking process that associates successiv e detections ov er time. This tracking stage introduces temporal continuity into the percep- tual representation by establishing persistent object identities across frames, thereby enabling the construction of dynamic entities whose evolution can subsequently be reasoned about at the semantic lev el. All visual processing stages are e xecuted Fig. 1: Unified Multimodal Dataset: Data from hetero- geneous vehicles are abstracted into a unified multimodal representation that jointly encodes visual perception, vehicle telemetry , and external contextual information within the mul- timodal data platform. entirely of fline, as they operate at video frequency , incur high computational cost, and are therefore intentionally decoupled from any real-time or online reasoning loop. Let D t = { d 1 t , . . . , d N t t } be the set of detections at time t . T racking defines an association function a t : D t → I ∪ { ∅ } , which assigns each detection to a persistent identity i ∈ I (or ∅ for unmatched detections), thereby inducing temporal continuity across frames. 2) Internal Fusion: Follo wing modality-specific process- ing, the platform performs multimodal fusion to integrate complementary information sources into a unified semantic representation of the driving scene. This fusion stage operates deterministically , relying on temporal alignment and spatial consistency across modalities rather than on learned or heuris- tic decision logic, and is designed to consolidate heterogeneous observations into a coherent semantic structure that can serve as a stable substrate for subsequent scenario abstraction and higher-le vel interpretation. The fusion of object detection and tracking outputs yields persistent semantic entities, each asso- ciated with spatial attributes, temporal extent, and categorical information, and explicitly linked to lane geometry estimates. This association enables the deriv ation of spatial relationships such as lane af filiation, relativ e position with respect to the ego vehicle, and proximity to road boundaries, transforming isolated detections into conte xt-aw are scene elements that retain temporal continuity and spatial meaning. 3) External contextual enrichment: In addition to dynamic perceptual inputs, the scene representation is enriched with ex- ternal contextual information obtained through external APIs. Static and semi-static road attributes, including road type, number of lanes, and side walk presence, are integrated from digital map services such as OpenStreetMap, while basic en vironmental descriptors, including weather conditions, are retriev ed from external weather APIs and provide complemen- tary situational context. Importantly , the multimodal fusion stage does not perform semantic reasoning, risk assessment, or interpretiv e judgment; its role is strictly limited to organizing and aligning heterogeneous data into a consistent seman- tic substrate, on which scenario abstraction and cogniti ve ev aluation can subsequently be conducted in a controlled, reproducible, and safety-oriented manner . C. Data Layer , Services and Platform Consumption 1) Normalized Data Layer: Following multimodal fusion and temporal alignment, all scene representations are consoli- dated into a single unified data layer organized at a fixed tem- poral resolution of 1 Hz. Each temporal unit within the unified layer corresponds to a semantic snapshot of the dri ving scene. These snapshots aggregate ego-vehicle dynamics, persistent object representations, road and infrastructure attributes, en- vironmental conte xt, and acquisition-lev el metadata into a coherent structure that can be queried and recomposed de- terministically . By concentrating all relev ant modalities into a single temporally index ed representation, the platform enables complete scene reconstruction and interpretation from a single query , supporting both interactive exploration and automated 5 analysis workflows. The explicit separation between raw data, processed perceptual outputs, and the normalized semantic layer ensures that computationally expensiv e perception stages are not repeated, while preserving flexibility for complex querying and scenario recomposition. In this organization, indexing is treated as a semantic and temporal construct rather than as a database-specific mechanism, enabling efficient access to context elements and consistent reuse of identical scene representations across multiple ev aluation campaigns. This layered design supports scalability while preserving the methodological rigor required for reproducible scenario-based analysis. 2) Back end Data Services: Built directly on top of the normalized data layer, the platform backend provides online data services that expose multimodal scene representations in a controlled, efficient, and reproducible manner . Rather than coupling consumers to implementation-specific storage details, the service layer can be formalized as a deterministic query operator ov er the unified temporal representation. The service layer supports context-oriented querying, enabling retrie v al of complete temporal scenes filtered by semantic, spatial, and dynamic criteria, including road characteristics, object con- figurations, environmental conditions, and ego-vehicle behav- ior . In addition to fine-grained filtering, the backend ensures consistent retrie val and organization of temporal contexts, enabling identical scenario definitions to be accessed, reused, and compared across dif ferent analytical processes. Let U denote the normalized scene state at a fixed tem- poral resolution of 1 Hz, indexed by acquisition identifier a and discrete time t . Let θ denote a structured scenario query encoding the semantic, spatial, behavioral, and temporal constraints that define a scenario of interest. The backend ex ecutes this query ov er the normalized representation and materializes the corresponding scenario context S , which is a structured collection of normalized scene states index ed by the acquisition-time pairs selected by the query . This operation can be expressed as S = B ( U ( a, t ) , θ ) , (3) where B ( · ) denotes the deterministic backend query operator responsible for scenario construction. The set of valid ( a, t ) indices defining the temporal extent of S is implicitly in- duced by the query specification θ , ensuring that the scenario S materializes all database time instants whose normalized scene states satisfy the query constraints, such as specific vehicle types, illumination conditions, weather context, road en vironment, or traffic density , while preserving their original temporal ordering. By operating on the normalized 1 Hz representation, the service layer leverages a temporally sparse yet semantically dense abstraction that is significantly lighter than the original high-frequency sensor streams. This design creates the compu- tational margin required for ef ficient online querying, scenario reuse, and scalable consumption across multiple platform interfaces. 3) Consumption Interfaces: The backend data and services are accessed via complementary platform interfaces that sup- port both interacti ve and automated workflo ws. The web- Fig. 2: Deterministic scenario construction and consumption workflo w . Normalized scene states U ( a, t ) , indexed at 1 Hz, are queried by backend data services to materialize scenario snapshots S ( a, t ) through the deterministic operator B ( · ) un- der a structured query specification θ . Scenario construction combines semantic filters spanning visual perception, vehicle telemetry , weather context, and map information. The resulting scenarios are e xposed consistently to both the web-based authoring interface for selection and inspection and the local testing module for scenario ev aluation. based interface provides a user-oriented en vironment for vi- sual exploration of normalized scene representations, enabling synchronized inspection of perceptual outputs, contextual at- tributes, and temporal information for each acquisition. Within this interface, users can execute context-oriented queries to retriev e complete temporal scenes and curate selected tempo- ral units into persistent scenario groupings, which explicitly reflect analytical interests and ev aluation objectiv es while preserving the deterministic semantics of the underlying rep- resentation. In parallel, a local consumption interface supports external analytical modules that operate outside the web plat- form. This interface enables the systematic consumption of predefined scenario groupings, exposing structured multimodal inputs to automated workflo ws such as model-driven analyses or language-model-based ev aluation pipelines. Formally , the web interface materializes the scenario con- text as a canonical artifact S , which serves as the unified representation of a curated driving scenario. This artifact is subsequently consumed by a local processing pipeline through a dedicated consumption operator, without altering its seman- tic or temporal structure. The relationship between interface, scenario, and consumption can be expressed as I web → S C local − − − → O , (4) where I web denotes the web-based authoring interface, C local ( · ) denotes a deterministic local consumption operator , and O rep- resents the structured outputs produced by automated analysis pipelines. Once scenarios are selected and materialized through the web interface, they transition from data-access abstractions into ex ecutable experimental units. In this stage, each canoni- cal scenario S is treated as an input to local processing work- 6 flows, where additional structural transformations are applied to support controlled analysis and ev aluation. In particular, scenario e xecution operates over temporally extended contexts deriv ed from anchor instants selected during interactive ex- ploration, enabling the systematic reconstruction of dynamic scene ev olution around moments of interest. I V . S Y S T E M A R C H I T E C T U R E The system architecture defines how user-selected normal- ized multimodal data are transformed into executable scenario instances and systematically consumed by local analysis work- flows. Rather than operating directly on full acquisitions, the platform adopts a scenario-based abstraction in which curated scenario contexts S are treated as atomic experimental units. These units are subsequently expanded in time and enriched with controlled contextual structure to support reproducible testing, model comparison, and behavioral ev aluation. A. T emporal Scenario Extend Although scenario definitions are selected through the web- based interface, temporal scenario extension is ex ecuted lo- cally over the canonical scenario contexts provided by the platform consumption layer . Each scenario S contains one or more normalized scene snapshots selected during interacti ve exploration and treated as semantic anchors for subsequent construction. A selected snapshot corresponding to a moment of interest is denoted as a temporal anchor t 0 . For each temporal anchor , the platform generates a scenario windo w by applying a configurable temporal expansion specified in sec- onds. In particular , users define ho w many seconds of context are included before and after the anchor instant by selecting the pre-ev ent and post-ev ent extents k and m , respectively . The resulting temporal windo w is defined as ∆ T = [ t 0 − k , t 0 + m ] , where k and m denote the pre-event and post-ev ent temporal extents, respecti vely . This windowing strategy introduces con- trolled temporal context around the anchor moment, enabling the inclusion of dynamic ev olution immediately preceding and following the ev ent of interest. By explicitly defining tempo- ral bounds during scenario generation, all scenarios follow consistent and reproducible structural rules, while allowing asymmetric temporal configurations when required by the analysis task. The resulting scenario window is composed of a compact sequence of normalized temporal units centered on the user- defined anchor moment and constrained to the predefined temporal span. All multimodal information av ailable in the normalized data layer within this interval is included, allowing each scenario to capture a coherent and self-contained repre- sentation of the traffic situation without introducing additional preprocessing or inference stages. The scenario generation procedure is fully deterministic, meaning that identical tempo- ral configurations always produce identical scenario instances, which is essential for ensuring reproducibility across repeated analyses and experimental executions [23, 24]. As illustrated in Fig. 4, this controlled temporal expansion establishes each Fig. 3: Scenario windo w generation from a single user-defined key moment ( t 0 ), illustrating controlled temporal expansion before and after the anchor instant. scenario as a stable and reusable analytical unit, forming the foundation for subsequent interpretation and ev aluation stages. B. Scenario Interpretation via Pr ompted LLM Evaluation In addition to the scenario-generation stage, the platform in- corporates an explicit prompt-engineering layer that primarily operates through the local consumption interface. This layer is responsible for transforming structured temporal scenarios into controlled analytical tasks suitable for language-model- based interpretation while maintaining strict alignment with the underlying normalized multimodal data. Prompt construction operates over fixed temporal scenario windows. Based on this windo wed representation, the platform sup- ports different prompt configurations that define ho w each sce- nario is presented to the language model. These configurations include fix ed prompt templates, designed to enforce standard- ized analytical tasks across all scenarios, and customizable prompt formulations, which allow alternativ e objectiv es or perspectiv es to be specified while preserving the same un- derlying scenario structure. In all cases, prompt content is systematically deriv ed from the normalized data associated with the temporal window , including ego-v ehicle dynamics, detected objects, road attributes, and en vironmental context. Decoupling scenario definition from prompting enables reuse across models and configurations. C. Pr ompt Specification for Cognitive Risk The platform adopts a structured prompt specification to constrain ho w language models represent and express traf fic risk. Rather than producing free-form textual explanations, models must operate within a closed semantic space defined by numeric codes, ensuring consistency , comparability , and machine-readability across all ev aluated scenarios and models. The risk representation space encodes multiple complemen- tary dimensions of traf fic safety , including overall severity , 7 conflict types, e go-vehicle behavior , vulnerable road user pres- ence, temporal dynamics, uncertainty , and e vidence attrib ution. Each dimension is expressed exclusi vely through predefined numeric identifiers, prev enting semantic drift and subjectiv e reinterpretation. Risk Encoding and Numeric T axonomy Overall Risk Severity (overall_risk_level) 0 Not identified; 1 Safe mode; 2 Lo w; 3 Moderate; 4 Ele vated; 5 High; 6 Critical Risk Occurrence Indicator (window_has_risk) 0 No risk identified (overall_risk_le vel = 0); 1 Risk identified (ov erall_risk_lev el = 1) Evidence Attribution Signals (evidence_signals) 1 Object presence; 2 Object distance; 3 Object–lane relation; 4 Speed; 5 Steering; 6 Braking; 7 Road/environment; 8 Image input Risk Category T ypes (risk_types) 2 Pedestrian; 3 Cyclist; 4 Rear-end; 5 Lateral conflict; 6 Intersection; 7 Speed; 8 V isibility; 9 Infrastructure; 10 T raffic density Once the semantic space is fixed, a global prompt constraint layer is applied to ensure deterministic behavior and strict epistemic discipline across all e valuated language models . This layer defines the role of the model, limits admissible e vidence sources, and enforces a rigid output format, independently of any specific risk semantics. Concretely , it operates as a fixed prompt prefix that constrains all responses to a strictly structured JSON representation using only predefined numeric codes, thereby transforming model outputs into machine- readable and directly comparable analytical artifacts. Prompt Constraints and JSON Output Enfor cement Y ou are an expert in urban traffic risk assessment. Using ONL Y the information explicitly provided in the input (structured CAN per second, Y OLO detections including dist_m and lane_rel, basic context, and IF PRESENT any provided images). IMPOR T ANT : • Output MUST be valid JSON only . No markdown. No extra text. • Use ONL Y the numeric codes pro vided (do not output strings). • Do NOT in vent missing data. This structured schema transforms model outputs into di- rectly comparable analytical artifacts. D. Result Logging and T raceable Evaluation The final stage of the platform is dedicated to the systematic recording, aggregation, and analysis of results from language- model-based workflo ws. This stage consolidates automatically captured execution metrics with structured e valuation outputs, enabling a comprehensiv e assessment of model beha vior under Fig. 4: Scenario-driv en workflo w of the platform, from web- based identification of key moments to local temporal expan- sion and structured prompt construction. Curated scenarios are transformed into bounded temporal windows and indepen- dently ev aluated by multiple language models under identical conditions, thereby establishing a deterministic, reproducible path from interactiv e scenario selection to controlled model inference. controlled, reproducible, and scalable experimental conditions. All results are persistently stored and explicitly linked to their corresponding scenarios, temporal windows, and prompt configurations, ensuring full traceability across ev aluation runs and supporting longitudinal analysis. Formally , the outcome of an e valuation run is represented as R = L ( S e , M e , P e ) , (5) where L ( · ) denotes the integrated logging and ev aluation operator , and S e , M e , and P e correspond to the sets of sce- narios, language models, and prompt specifications effecti vely selected for a given experimental execution e . These sets are not assumed to be fixed or e xhaustive; instead, the y are defined according to the objecti ve of each ev aluation run. This abstraction deliberately leav es the internal ex ecution order and pairing strategy unspecified. Within this formu- lation, experimental designs may isolate individual factors, such as comparing dif ferent models under fixed scenarios and prompts, or ev aluating prompt sensitivity for a single model, or jointly v ary multiple dimensions or repeat identical configurations to assess consistency and robustness across ex ecutions. By decoupling the conceptual definition of results from implementation-specific scheduling details, the formula- tion preserv es fle xibility while maintaining strict traceability and comparability across ov erlapping ev aluations. V . E X P E R I M E N T A L S E T U P T o ev aluate the proposed dual-stage framework in interpret- ing complex driving en vironments, we conduct a series of 8 Fig. 5: Mosaic of representativ e sixteen scene anchors, that serve as temporal seeds for subsequent scenario expansion, illustrating the di versity of near -people situations across urban layouts, traffic densities, lighting conditions, and pedestrian configurations. Fig. 6: Online visualization of the ev aluated test sets a v ailable in the platform. Each entry corresponds to a near-people ke y- moment collection e v aluated under identical conditions by dif- ferent language models, including two text-only configurations and one multimodal model. experiments focused on "near-people" urban scenarios. This section details the experimental configuration, beginning with the protocol for selecting scenarios and curating data from the multimodal platform. W e then describe the specific LLM configurations utilized as cogniti ve probes, the design of the deterministic prompt templates, and the structured ev aluation metrics used to quantify model performance. By standardiz- ing these experimental parameters, we establish a controlled en vironment to assess how ef fectiv ely the framework handles semantic ambiguity and risk communication in safety-critical contexts. A. Collection and Evaluation Pr otocol T o ensure a controlled and semantically meaningful ev al- uation of language model behavior , all experiments were conducted over the same curated collection of ke y moments specifically designed to emphasize safety-critical urban inter- actions. Rather than operating ov er full-length acquisitions or arbitrarily sampled time windows, the ev aluation focuses on temporally anchored scenario fragments centered around the presence of vulnerable road users in close proximity to the ego vehicle. In particular , a dedicated near-people key- moment collection was constructed by identifying temporal anchors corresponding to situations in which pedestrians are spatially close to the ego v ehicle. For each anchor time instant, a fixed temporal window spanning three seconds before and three seconds after the anchor was extracted, resulting in a standardized sev en-second scenario window . This anchoring strategy follows the temporal expansion formulation intro- duced in the scenario generation frame work, ensuring that each scenario captures both the contextual lead-up and the imme- diate e volution of the interaction. A total of sixteen distinct scenario windo ws were selected follo wing this criterion and are available at carcara.com/69460/scenario-selected. While larger -scale ev aluations in volving hundreds or thousands of scenarios are feasible within the platform, the present study deliberately adopts a small-scale experimental setting. This choice enables clearer interpretation of model behavior , facil- itates detailed qualitativ e and quantitativ e analysis, and serv es as an initial validation of the proposed ev aluation pipeline before large-scale deployment. Formally , the e v aluation set is defined as a collection of N = 16 temporally anchored scenario windows, each centered on a key moment t ( i ) 0 and symmetrically expanded by three seconds before and after the anchor , ∆ T ( i ) = h t ( i ) 0 − 3 , t ( i ) 0 + 3 i , i = 1 , . . . , 16 , (6) yielding fixed-duration sev en-second scenarios with the inter- action event centrally aligned. All selected scenarios were ev aluated under identical con- ditions by three different language models: a lightweight text-only model, a higher-capacity text-only model, and a multimodal model with visual grounding. Each model recei ved the same structured scenario representation and prompt spec- ification, ensuring that observed behavioral differences arise from intrinsic model characteristics rather than v ariations in input data or experimental setup. B. Evaluated Language Models Scenario ev aluations in this work adopt an organized batch- based e xecution strategy as a deliberate methodological choice, rather than as a fixed or inherent constraint of the platform. This strategy is selected to enable controlled, repeatable, and scalable comparison of language model behavior across het- erogeneous dri ving scenarios, while preserving strict isolation between individual analytical contexts. Both text-only and multimodal models are considered, en- abling a systematic in vestigation of how access to visual infor - mation influences semantic scene interpretation, risk percep- tion, and the generation of assistance-oriented recommenda- tions. In the multimodal setting (GPT V ision), models recei ve raw visual inputs (images) jointly with structured, normalized textual descriptors, enabling internal encoding of visual cues 9 such as object appearance, spatial relationships, and scene layout and their direct integration into the reasoning process. In contrast, text-only models (GPT 4-Mini and DeepSeek-Chat) operate exclusiv ely on symbolic, structured textual represen- tations deriv ed from perception modules, relying on external scene descriptions rather than direct visual e vidence. V I . R E S U LT S A N D D I S C U S S I O N This section reports the empirical outcomes of a stan- dardized, scenario-based ev aluation that compares multiple language models under identical input and prompt constraints. For every model, the same prompt specification and the same curated set of near-people temporal windo ws are pro vided, ensuring identical inputs and output constraints across runs. An online view of the ev aluated test sets is av ailable at carcara.com/6946/llm-tests. A. Model-Level Risk Interpr etation: Analysis Logic and De- rived Metrics Before presenting the quantitativ e comparisons, it is im- portant to clarify the analytical logic used to interpret the structured outputs produced by the ev aluated language models. Rather than treating the reported metrics as isolated statistics, the analysis follows a conditional reasoning scheme that maps numerical tendencies to qualitativ e behavioral profiles. This procedure enables a consistent interpretation of model behav- ior across identical scenario windows and prompt conditions. Algorithm 1: Conditional Semantic Analysis of LLM Risk Outputs Require: Parsed LLM outputs for a fixed scenario set, risk threshold τ Ensure: Model-level qualitati ve risk characterization 1: Group all scenario windows by language model m 2: for all model m do 3: µ risk ← mean overall risk lev el 4: ρ high ← proportion of windows with risk ≥ τ (threshold-based) 5: µ evidence ← mean number of distinct evidence signals 6: F ← frequency distribution of dominant risk factors (primary category per window) 7: if µ risk is high then 8: Mark model as risk-conservative 9: else 10: Mark model as risk-tolerant 11: end if 12: if ρ high is high then 13: Indicate ele v ated sensiti vity to critical situa- tions 14: end if 15: if µ evidence is low then 16: Indicate narrow or underspecified reasoning 17: else 18: Indicate broader contextual grounding 19: end if 20: if F is concentrated on few factors then 21: Indicate specialized risk perception 22: else 23: Indicate diversified risk awareness 24: end if 25: end for The conditional logic summarized in Algorithm VI-A serves as an interpretati ve bridge between the raw structured outputs and the comparativ e analyses discussed in the follo wing sub- sections. By e xplicitly defining ho w risk sev erity , threshold- based alert escalation, evidence usage, and dominant causal attribution are jointly interpreted, the analysis ensures that observed differences across models reflect distinct cognitive risk profiles rather than arbitrary metric fluctuations. B. Overall Risk Assessment Behavior 1) Mean Over all Risk Level: The mean o verall risk le vel captures the baseline risk posture adopted by each language model when interpreting identical near -people scenarios. This metric reflects how conserv ati vely or permissiv ely a model classifies safety-critical situations under fixed semantic and contextual constraints. The e v aluated models exhibit clear stratification in their baseline risk interpretations. The multi- modal model consistently assigns higher av erage risk le vels across scenario windows, indicating a more precautionary stance. In contrast, the text-only models show lo wer mean risk values, suggesting a more restrained interpretation of comparable situations despite operating on the same structured evidence. Although the number of e valuated scenario windows is limited, the observed ordering remains consistent across the dataset, indicating a stable baseline tendency rather than ran- dom variation. Notably , this div ergence emerges ev en though all models operate under the same semantic schema and prompt specification, with the multimodal model dif fering solely through visual grounding rather than additional sym- bolic inputs. As illustrated in Fig. 7, these differences indicate that baseline risk interpretation varies systematically across mod- els, ev en when scenario content and prompt structure are held constant. This initial posture establishes distinct starting points for subsequent reasoning stages, thereby influencing how each model approaches subsequent alert-escalation and causal-attribution analyses. 2) High-Risk Escalation T endencies: Beyond baseline risk interpretation, the threshold-based proportion of scenario win- dows escalated to high risk provides insight into how often each model signals that situations require elev ated attention. In this analysis, high risk is defined as scenario windows with an ov erall risk lev el of four or higher , indicating elev ated or more sev ere safety conditions. This metric captures alert-escalation tendencies rather than the a verage risk posture, approximating how often a cognitive module would trigger strong driv er- facing warnings under identical conditions. The ev aluated models exhibit clear differences in escalation behavior . The multimodal model escalates a larger fraction of scenario windows to high risk, indicating heightened sen- 10 Fig. 7: Mean ov erall risk le vel assigned by each e v aluated language model across all scenario windows. Fig. 8: Percentage of scenario windows classified as high risk (risk level ≥ 4 ) by each model. sitivity to potentially critical situations. In contrast, the text- only models adopt more selecti ve escalation strategies, clas- sifying fewer windows as high risk despite operating on the same structured e vidence and prompt constraints. This pattern suggests that visual grounding is associated with both higher perceiv ed severity and more frequent escalation of situations to alert-worthy states. Importantly , escalation frequency and mean risk lev el re- flect related but distinct dimensions of cognitiv e behavior . A model may maintain a comparati vely elev ated baseline risk assessment while selectively escalating, or , con versely , escalate frequently despite moderate average risk scores. The observed div ergence indicates that alert escalation constitutes an independent behavioral axis, shaping how aggressi vely a model prioritizes safety-relev ant ev ents. From an AD AS design perspectiv e, these dif ferences sug- gest practical implications. Higher escalation rates may en- hance early hazard salience in complex urban en vironments, but they also increase the risk of alert fatigue if not calibrated using context-aw are thresholds or multi-stage alert policies. More conservati ve escalation strategies may reduce unneces- sary alerts, but risk delaying the emphasis of marginal yet safety-relev ant situations. These trade-offs are reflected in the distribution of high-risk classifications summarized in Fig. 8. 3) Evidence Usag e and Reasoning Br eadth: The mean number of distinct e vidence signals explicitly referenced by each model provides a proxy for reasoning breadth and for externalizing justification. Operationally , this metric is com- puted as the number of unique evidence signals listed in the structured evidence priority field of each model output. In this ev aluation, evidence signals correspond to structured contex- Fig. 9: Mean number of distinct evidence signals explicitly referenced in model outputs, reflecting reasoning breadth and justification externalization. tual elements av ailable in the scenario representation, such as ego-v ehicle dynamics, object detections and distances, lane relations, road attributes, and en vironmental cues. Because the prompt enforces a constrained JSON schema, this metric reflects which signals the model chooses to externalize as explicit justification rather than the full set of information it may have processed internally . The text-only models exhibit higher av erage e vidence counts, indicating a more enumerative explanation style that verbalizes a broader set of supporting cues. In contrast, the multimodal model references fewer distinct signals, producing more compact justifications. This difference suggests that access to visual input may reduce the need for explicit textual enumeration of contextual cues, as some salient information can be internally grounded in the image. As a result, a trade- off emerges between conciseness and traceability: broader evidence externalization improves auditability , whereas more compact outputs fa vor bre vity at the potential cost of trans- parency . From a methodological perspectiv e, these findings empha- size that explainability in LLM-based AD AS extends beyond decision correctness to encompass an externalization policy for evidence. Multimodal grounding may enhance perceptual richness and risk sensitivity , while simultaneously reducing the explicit articulation of non-visual cues. For driver -facing assistance, additional prompting or post-processing may be required to ensure that critical signals, such as object distance, lane-relativ e position, and ego dynamics, are consistently reported even when visual context is a v ailable. These trends are quantitatively summarized in Fig. 9. C. Risk F actor Attribution P atterns Beyond magnitude and escalation, the attribution of risk provides insight into ho w models frame causal explanations under identical evidence. While all ev aluated models operate on the same structured scenario inputs, they differ in ho w they assign primacy to specific risk sources, revealing distinct patterns of causal emphasis. In this analysis, the dominant risk factor corresponds to the primary causal category explicitly selected by the model for each scenario windo w . Across all models, pedestrian-related risk emerges as the dominant attribution, which is expected 11 Fig. 10: Distribution of dominant risk factors identified by each model. The six most frequent f actors are sho wn explicitly , with the remaining factors grouped as Other . giv en that the test set is intentionally curated around near- people interactions and reflects strong safety priors tow ard vulnerable road users. This shared primary attribution suggests a consistent baseline alignment with ADAS safety objectiv es when vulnerable road users are present. Howe ver , once the primary factor is fixed, the attribution structure di verges substantially across models. Although only a single dominant factor is selected per scenario window , the rel- ativ e frequency of non-primary categories reveals systematic differences in secondary attrib ution tendencies across models. The DeepSeek text-only model most frequently elev ates rear- end conflicts as the next dominant category , suggesting a tendency tow ard interaction geometry and collision archetypes, inferred from object proximity and ego-v ehicle dynamics. The GPT -4o-mini text-only model, in contrast, more often selects lateral conflict as its dominant non-pedestrian category , indicating a stronger emphasis on side interactions and spatial lane-relativ e relations. The multimodal GPT -4o-vision model exhibits a qualitati vely different secondary attribution pattern, more frequently elev ating traf fic density as the prominent non- pedestrian factor , reflecting a broader contextual framing in which global scene complexity becomes salient once visual cues are a vailable. These attributional dif ferences are summa- rized quantitativ ely in Fig. 10, which reports the distrib ution of dominant risk factors selected by each model across all e v alu- ated scenario windows. The figure makes explicit that, beyond the shared pedestrian-centric primary risk, each model system- atically privileges different secondary explanations giv en the same evidence. From a cognitiv e AD AS perspectiv e, these observed dif- ferences are nontrivial. Even when models agree that a sce- nario is risk y and con ver ge on the presence of a vulnerable road user , they may guide driver attention to ward different aspects of the en vironment. A model that foregrounds collision archetypes (rear-end or lateral conflict) may produce alerts that are more directly actionable for immediate defensiv e maneuvers, whereas a model that foregrounds traffic density may emphasize situational vigilance and scene-level caution. These findings indicate that model choice affects not only how much risk is communicated, but also how surrounding contextual risks are framed, foreshadowing deeper sources of inter-model ambiguity in VR U presence interpretation. D. Inter-Model Ambiguity and Composite Uncertainty Acr oss Scenarios The analyses presented in the pre vious subsections rev eal consistent inter-model behavioral differences in baseline risk assessment, threshold-based escalation, evidence e xternaliza- tion, and dominant causal attribution, ev en under identical scenario inputs and fix ed prompt constraints. While each of these dimensions provides an isolated view of model behavior , a more comprehensiv e picture emerges when these di vergences are jointly considered. In this work, inter-model ambiguity is therefore treated as a composite phenomenon arising from the combined interaction of multiple semantic and decision- related dimensions, rather than from disagreement in any single metric. T o operationalize this notion of ambiguity , a two-stage formulation is adopted. First, ambiguity is detected using a minimal div ergence criterion, in which any scenario window where at least one model deviates from inter-model consensus is flagged as ambiguous. Second, the degree of ambiguity is quantified using continuous disagreement measures jointly deriv ed from the four previously analyzed dimensions: ov erall risk se verity , high-risk escalation, evidence usage breadth, and dominant risk factor attribution. This formulation enables a scenario-centric characterization of uncertainty that captures not only whether models disagree, but also ho w strongly and consistently such disagreement manifests across multi- ple dimensions. The resulting composite uncertainty score is normalized to the [0 , 1] interval, where 0 denotes full inter - model agreement across all ev aluated dimensions and models, and 1 denotes maximal di ver gence, corresponding to consistent disagreement across all dimensions. While ambiguity detection itself follo ws a minimal criterion, the associated uncertainty magnitude reflects graded, structured disagreement rather than a purely binary outcome. Across the e valuated test set of sixteen near-people sce- narios, the resulting composite uncertainty scores reveal a structured distribution of inter-model alignment rather than uniform disagreement. Based on the magnitude of the composite uncertainty , the scenario set partitions natu- rally into three uncertainty tiers. The lo w-uncertainty subset, denoted as S low = { S 3 , S 2 , S 14 , S 15 , S 4 } , corresponds to cases in which all ev aluated models remain largely aligned across the considered dimensions, reflecting situations in which semantic interpretation and risk assessment are consis- tently resolved. The intermediate-uncertainty subset, S med = { S 5 , S 9 , S 16 , S 1 , S 8 , S 11 } , reflects partial div ergence typically confined to one or two dimensions, such as dif ferences in evi- dence prioritization or threshold-based escalation. Finally , the 12 Fig. 11: Radar-based summary of inter -model behavior across near-people scenarios, highlighting systematic differences in risk attribution, high-risk escalation, VRU presence interpretation, e vidence signal count (reasoning breadth), uncertainty e xpression, and relativ e token usage. The visualization illustrates how identical scenario artifacts yield di ver gent yet internally consistent interpretations across text-only and multimodal LLM configurations. high-uncertainty subset, S high = { S 6 , S 12 , S 7 , S 13 , S 10 } , con- centrates scenarios exhibiting pronounced inter-model div er- gence across multiple dimensions simultaneously , representing structurally ambiguous situations in which multiple semantic and decision-related interpretations remain plausible. In aggre- gate, these subsets satisfy ( |S low | = 5 , |S med | = 6 , |S high | = 5) , highlighting that inter -model ambiguity is graded and scenario- dependent rather than uniformly distributed. Figure 12 summarizes this analysis using a compact heatmap representation. Each row corresponds to a scenario, labeled by its original scenario identifier and ordered from lowest to highest composite uncertainty , while each column corresponds to one of the e valuated language models. Color intensity reflects the per-model contribution to composite un- certainty , aggre gated across the four dimensions. By construc- tion, the composite uncertainty score is normalized to the [0 , 1] interval, where v alues close to 0 indicate strong inter-model alignment across all dimensions, and v alues approaching 1 indicate cumulati ve disagreement across multiple dimensions. This ordering enables a visual “certainty-to-ambiguity” tra- jectory , highlighting ho w inter-model di vergence increases gradually across scenarios rather than appearing as isolated outliers. From a methodological perspective, this composite uncertainty representation serv es as an e xplicit audit layer that bridges individual metric comparisons and holistic scenario in- terpretation. Rather than treating inter-model disagreement as noise or isolated anomalies, the proposed formulation rev eals structured, reproducible ambiguity patterns under controlled ev aluation conditions. From an AD AS and SOTIF perspectiv e, these findings emphasize that safety-relev ant uncertainty arises not only from perceptual limitations but also from diver - gent semantic interpretations of partially specified scenarios. Explicitly modeling and visualizing such graded ambiguity is therefore essential for scenario-centric validation, model auditing, and the design of rob ust decision-support pipelines that remain resilient under semantic indeterminacy . E. Discussion: Implications for Scenario-Centric A uditing and SOTIF-Rele vant V alidation The radar visualization in Fig. 11 provides a consolidated behavioral signature of the inter-model behavioral patterns Fig. 12: Compact heatmap of composite inter-model uncer- tainty across the e v aluated scenarios. Rows correspond to scenario identifiers, ordered from lowest to highest uncertainty (bottom to top), and columns correspond to the ev aluated language models. Color intensity indicates the per-model con- tribution to composite uncertainty , aggre gated over o verall risk sev erity , high-risk escalation, evidence usage, and dominant risk factor attribution. previously discussed, aggregating the primary risk-related di- mensions ev aluated across the near-people scenario set. Rather than introducing ne w measurements, this figure synthesizes the previously presented outcomes, enabling direct comparison of model-lev el risk profiles across identical scenario windows and fixed prompt constraints. The summary highlights systematic div ergence across models, most notably in baseline sev erity attribution and high-risk escalation frequency . Consistent with the earlier analysis, the multimodal configuration adopts a more precautionary stance, exhibiting a higher mean risk and more frequent escalations beyond the high-risk threshold, whereas the text-only models remain comparati vely conserv a- tiv e despite operating on the same structured scenario artifacts. This synthesis supports the core methodological premise that 13 controlled scenario reuse rev eals stable, model-specific risk- interpretation profiles that are both comparable and repro- ducible. These diver gences are consistent with SO TIF-style performance limitations, where hazardous outcomes can arise without malfunction when e vidence is incomplete or semanti- cally underdetermined. Beyond the primary risk dimensions, the visualization fur- ther incorporates uncertainty expression (as encoded in the JSON output) and relati ve token usage as secondary beha vioral indicators, offering deeper insight into how models opera- tionalize justification and confidence. T aken together, text- only models more explicitly externalize uncertainty while producing longer outputs, reflecting a strategy that fa vors broader evidence enumeration and explicit justification. In contrast, the multimodal model exhibits lower relative token usage and more compact uncertainty signalling, suggesting a reliance on internalized perceptual grounding rather than extensi ve textual elaboration. This contrast rev eals a trade- off between conciseness and audit traceability: while shorter responses may enhance efficiency and decisiv eness, more verbose outputs can facilitate post-hoc inspection and ac- countability . In practice, this trade-off can be treated as a controllable design variable through prompt constraints and post-processing policies that enforce a minimum lev el of evidence disclosure for safety-critical alerts. Importantly , the joint analysis of risk severity , uncertainty , and token usage demonstrates that these dimensions are not independent; they jointly characterize distinct cognitiv e styles across models, supporting the role of LLMs as auditable cognitive probes in scenario-centric driving-assistance ev aluation. Notably , these findings do not motiv ate using LLM outputs as ground-truth perception, but rather as auditable interpreti ve probes whose div ergences can be measured, compared, and managed. V I I . C O N C L U S I O N This study analyzed ho w LLMs interpret identical urban driving scenarios under structured, scenario-centric ev aluation. Despite identical inputs and prompts, models exhibited sys- tematic diver gence in se verity assessment, escalation behavior , evidence use, and causal attribution. Disagreement e xtended to the interpretation of vulnerable road users, reflecting in- trinsic semantic indeterminacy rather than an isolated failure. These findings moti vate explicit ambiguity modeling when integrating LLM-based reasoning into safety-aligned AD AS validation. Future work will extend this frame work to large- scale scenario corpora and in vestigate how structured ambi- guity metrics can inform adaptive alert policies and safety assurance processes under SO TIF-oriented validation. R E F E R E N C E S [1] C. Sun, R. Zhang, Y . Lu, Y . Cui, Z. Deng, D. Cao, and A. Kha- jepour , “T oward ensuring safety for autonomous dri ving perception: Standardization progress, research advances, and perspectives, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 25, no. 5, pp. 3286–3304, 2024. [2] D. Feng, C. Haase-Schuetz, L. Rosenbaum, H. Hertlein, F . Timm, W . W iesbeck, and K. Dietmayer , “Deep multi-modal object detection and semantic segmentation for autonomous dri ving: Datasets, methods, and challenges, ” IEEE Tr ansactions on Intelligent T ransportation Sys- tems , v ol. 22, no. 3, pp. 1341–1360, 2021. [3] F . Sezgin, D. Vriesman, D. Steinhauser , R. Lugner , and T . Brandmeier , “Safe autonomous driving in adverse weather: Sensor ev aluation and performance monitoring, ” pp. 1–8, 2023. [4] Road vehicles — Functional safety , International Organization for Standardization Std. ISO 26 262, 2018. [Online]. A vailable: https://www .iso.org/standard/68383.html [5] Road vehicles — Safety of the intended functionality (SO TIF) , International Organization for Standardization Std. ISO 21 448, 2022. [Online]. A vailable: https://www .iso.org/standard/77490.html [6] C. Chang, D. Cao, L. Chen, K. Su, K. Su, Y . Su, F .-Y . W ang, J. W ang, P . W ang, J. W ei, G. W u, X. W u, H. Xu, N. Zheng, and L. Li, “Metascenario: A framework for driving scenario data description, storage and indexing, ” IEEE T ransactions on Intelligent V ehicles , vol. 8, no. 2, pp. 1156–1170, 2023. [7] L. Chen, O. Sinavski, J. Hünermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- lev el vector modality for explainable autonomous driving, ” in 2024 IEEE International Conference on Robotics and Automation (ICRA) , 2024, pp. 14 093–14 100. [8] J. Ge, C. Chang, J. Zhang, L. Li, X. Na, Y . Lin, L. Li, and F .-Y . W ang, “Llm-based operating systems for automated vehicles: A new perspectiv e, ” IEEE Tr ansactions on Intelligent V ehicles , vol. 9, no. 4, pp. 4563–4568, 2024. [9] J. D. Carv alho, H. T . Kenji, A. M. Saber , G. Melo, M. M. D. Santos, and D. Kundur , “Multimodal large language model framework for safe and interpretable grid-integrated e vs, ” pp. 1–5, 2025. [10] J. D. Carvalho, F . de Souza Forte, H. K. T aciro, G. Melo, and M. M. D. Santos, “Llm-po wered frame work for interpretable traf fic rule processing in autonomous dri ving, ” in 2025 IEEE 34th International Symposium on Industrial Electr onics (ISIE) , 2025, pp. 1–7. [11] A. Mohammed and R. K ora, “ A comprehensive ov erview and analysis of lar ge language models: Trends and challenges, ” Ieee Access , 2025. [12] ——, “ A comprehensive ov erview and analysis of large language models: T rends and challenges, ” IEEE Access , 2025. [13] H. Zhu, K.-V . Y uen, L. Mihaylova, and H. Leung, “Overview of en vironment perception for intelligent vehicles, ” IEEE T ransactions on Intelligent T ransportation Systems , v ol. 18, no. 10, pp. 2584–2601, 2017. [14] T . Neumann, “ Analysis of advanced driver -assistance systems for safe and comfortable driving of motor vehicles, ” Sensors , v ol. 24, no. 19, p. 6223, 2024. [15] S. Sun, A. P . Petropulu, and H. V . Poor, “Mimo radar for advanced driv er-assistance systems and autonomous driving: Adv antages and challenges, ” IEEE Signal Processing Magazine , vol. 37, no. 4, pp. 98– 117, 2020. [16] D. J. Y eong, K. Panduru, and J. W alsh, “Exploring the unseen: A survey of multi-sensor fusion and the role of explainable ai (xai) in autonomous vehicles, ” Sensors , v ol. 25, no. 3, p. 856, 2025. [17] D. Omeiza, H. W ebb, M. Jirotka, and L. Kunze, “Explanations in autonomous driving: A survey , ” arXiv preprint , 2021, survey discussing need for transparency and trust in A V explainability . [18] A. Kuznietsov , B. Gyevnar , C. W ang, S. Peters, and S. V . Albrecht, “Ex- plainable ai for safe and trustworthy autonomous dri ving: A systematic revie w , ” IEEE T ransactions on Intelligent T ransportation Systems , 2024, survey on explainable methods in autonomous driving, emphasizing transparency and trustworthiness. [19] G. V elasco-Hernandez, D. J. Y eong, J. Barry , and J. W alsh, “ Autonomous driving architectures, perception and data fusion: A review , ” T echnical Review / Archive , 2020, review discussing modular components of autonomous dri ving system architectures. [20] H.-Y . Lin, Y .-C. Huang, J.-X. Lai, and T .-T . Y ou, “Domain adaptation for vehicle detection under adverse weather, ” IEEE Open Journal of Intelligent Tr ansportation Systems , vol. 6, pp. 568–578, 2025. [21] M. Aeberhard, S. Schlichtharle, N. Kaempchen, and T . Bertram, “Track- to-track fusion with asynchronous sensors using information matrix fusion for surround environment perception, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 13, no. 4, pp. 1717–1726, 2012. [22] Road vehicles — Cybersecurity engineering , International Organization for Standardization and SAE International Std. ISO/SAE 21 434, 2021. [Online]. A vailable: https://www .iso.org/standard/70918.html [23] O. E. Gundersen and S. Kjensmo, “State of the art: Reproducibility in artificial intelligence, ” in Pr oceedings of the Thirty-Second AAAI Confer ence on Artificial Intelligence , 2018. [24] J. Pineau, P . V incent-Lamarre, K. Sinha, V . Lari vière, A. Beygelzimer , F . d’Alché Buc, E. Fox, and H. Larochelle, “Improving reproducibility in machine learning research, ” Journal of Machine Learning Researc h , 2021.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment