Dual-Stage LLM Framework for Scenario-Centric Semantic Interpretation in Driving Assistance

1 Dual-Stage LLM Frame work for Scenario-Centric Semantic Interpretation in Dri ving Assistance Jean Douglas Carv alho , Hugo T aciro K enji , Ahmad Mohammad Saber , Glaucia Melo , Max Mauro Dias Santos , and Deepa Kundur Abstract —Advanced Driver Assistance Systems (AD AS) in- creasingly rely on learning-based perception, yet safety-relevant failures often arise without component malfunction, driv en in- stead by partial observability and semantic ambiguity in how risk is interpreted and communicated. This paper presents a scenario- centric framework for repr oducible auditing of LLM-based risk reasoning in urban dri ving contexts. Deterministic, temporally bounded scenario windows are constructed from multimodal driving data and ev aluated under ﬁxed prompt constraints and a closed numeric risk schema, ensuring structured and comparable outputs across models. Experiments on a curated near-people scenario set compare two text-only models and one multimodal model under identical inputs and prompts. Results reveal sys- tematic inter -model diver gence in severity assignment, high-risk escalation, evidence use, and causal attrib ution. Disagreement extends to the interpretation of vulnerable road user presence, indicating that variability often reﬂects intrinsic semantic in- determinacy rather than isolated model failure. These ﬁndings highlight the importance of scenario-centric auditing and explicit ambiguity management when integrating LLM-based reasoning into safety-aligned driver assistance systems. Index T erms —Advanced Driver Assistance Systems, Large Language Models, Scenario-Based Evaluation, Semantic Risk Interpr etation, Explainable, Multimodal Perception I . I N T R O D U C T I O N Advanced Driver Assistance Systems (AD AS) are increas- ingly deployed to enhance road safety by supporting drivers in perception, hazard anticipation, and decision-making. Despite substantial progress in sensing technologies and learning- based perception, the reliability of AD AS remains funda- mentally constrained in real-world urban environments, where occlusions, illumination variability , and dense interactions with vulnerable road users (VR Us) are common [1, 2, 3]. In such settings, safety risks frequently arise not from component failures b ut from limitations in how complex trafﬁc situations are perceived, interpreted, and communicated to driv ers. From a safety engineering standpoint, these limitations are not fully addressed by traditional Functional Safety frame- works, such as ISO 26262, which primarily address hazards arising from system malfunctions [4]. Instead, man y critical AD AS risks arise when systems operate as intended yet exhibit insufﬁcient performance under ambiguous, uncertain, or unforeseen conditions. This class of fault-free hazardous behavior is explicitly addressed by the Safety of the Intended Functionality (SO TIF) frame work standardized in ISO 21448, which highlights perception limitations, semantic ambiguity , and foreseeable misuse as primary safety concerns [5]. Con- sequently , understanding and managing semantic uncertainty in perception and interpretation has become a central challenge for safety-oriented AD AS validation. Addressing this challenge increasingly requires moving be- yond raw sensor streams toward structured, scenario-centric representations. Scenario abstraction enables the decomposi- tion of complex driving interactions into discrete, comparable, and auditable units, thereby supporting systematic and repro- ducible e v aluation at scale. Frameworks such as MetaScenario formalize this concept by organizing multimodal driving data into semantically indexed scenario representations, facilitating consistent comparison across heterogeneous conditions while preserving real-world comple xity [6]. Importantly , such rep- resentations focus on scenario description and organization, remaining largely agnostic to how higher-le vel reasoning com- ponents interpret risk and intent within those scenarios. W ithin this scenario-centric paradigm, recent advances in Large Language Models (LLMs), including multimodal vari- ants, introduce new opportunities for high-le vel semantic interpretation of driving situations. Unlike con ventional per- ception models that operate at the signal or object-detection lev el, LLMs can integrate heterogeneous contextual cues, such as object relations, ego-vehicle dynamics, infrastructure attributes, and en vironmental context, into structured, human- interpretable assessments of trafﬁc situations [7, 8, 9]. In this work, LLMs are not treated as perception, control, or decision- making modules, b ut as interpretativ e and analytical compo- nents capable of reasoning over scenario-bounded contextual evidence [8, 10]. Howe ver , introducing LLMs into safety-critical driving con- texts raises fundamental questions regarding reliability , con- sistency , and safety alignment. Different models may produce div ergent semantic interpretations when exposed to identical trafﬁc scenarios, particularly under partial or ambiguous per- ception, leading to variability in risk attribution and VR U assessment [11, 12]. Rather than representing isolated model errors, such di vergences may reﬂect intrinsic semantic am- biguity in the av ailable evidence. W ithout a structured and reproducible ev aluation framework, these dif ferences remain difﬁcult to quantify , compare, or audit in a way that supports safety-oriented development. Motiv ated by these challenges, this work inv estigates how LLM-based semantic reasoning behaves when exposed to identical real-world dri ving scenarios under deterministic, scenario-centric representations and constrained semantic out- puts. By grounding the analysis in reproducible, scalable sce- nario abstractions deri ved from multimodal driving data, the proposed frame work enables systematic, large-scale compari- 2 son of inter-model differences in risk interpretation, evidence use, and sensitivity to vulnerable road users. In doing so, this study positions LLMs as interpretable cognitiv e probes for AD AS ev aluation, providing a controlled methodology to expose semantic ambiguity and assess its implications for safety-aligned system design. The remainder of this paper is or ganized as follo ws: Section II establishes the technical and conceptual foundations of the work, characterizing limitations in con ventional AD AS pipelines and formalizes the problem of context-dependent perception. Section III describes the multimodal data platform and the preprocessing pipeline used to generate the normal- ized 1 Hz semantic substrate. Section IV details the system architecture, focusing on the deterministic scenario construc- tion and the prompted LLM e valuation framework. Section V presents the e xperimental setup, including the collection protocol for safety-critical "near-people" scenarios and the language models ev aluated. Section VI provides a quantitativ e and qualitativ e analysis of the experimental results, followed by the concluding remarks in Section VII. I I . B AC K G R O U ND A N D P RO B L E M S TA T E M E N T This section outlines the technical foundations of the pro- posed frame work and motiv ates scenario-centric auditing of semantic risk interpretation in ADAS. AD AS rely fundamentally on accurate perception and timely interpretation of complex trafﬁc scenes to support safe dri ving decisions [13, 14]. Modern perception pipelines predominantly employ camera- and radar -based sensing combined with com- puter vision and signal processing algorithms to detect objects, estimate motion, and infer dri ving context [15, 16]. Despite signiﬁcant progress in deep learning–based perception, cur- rent ADAS architectures remain largely reactiv e and context- agnostic under rapidly changing environmental conditions such as lo w illumination, adverse weather , occlusions, or dense urban trafﬁc [1, 2, 3]. Beyond raw perception performance, a core practical limi- tation is that many AD AS deployments provide limited trans- parency re garding ho w warnings are triggered and ho w risk is framed in ambiguous situations. This becomes particularly problematic in mixed traf ﬁc and partial observability condi- tions, where human dri vers expect context-aw are and inter- pretable feedback [17, 18]. As a result, the safety rele v ance of an ADAS output is not solely determined by whether objects are detected, but also by how risk is semantically interpreted and communicated under uncertainty . A. Limitations of Con ventional ADAS Pipelines Most existing AD AS pipelines decouple perception, sensor control, and decision-making into independent modules [19]. V ision models such as Y OLO or semantic segmentation net- works operate on raw sensor inputs with limited awareness of sensor conﬁguration or en vironmental semantics [1, 2, 20]. Sensor calibration, when present, is typically performed of- ﬂine or through predeﬁned rule-based logic, which constrains adaptability and generalization across diverse conditions. From an engineering perspecti ve, this modular decoupling yields two persistent gaps. First, do wnstream risk reasoning (whether implemented as rules, heuristics, or learned modules) may inherit perceptual uncertainty without a systematic mech- anism to expose ho w ambiguity propagates into safety-relev ant interpretations. Second, explainability often remains limited: the system may trigger warnings or interventions without con veying the underlying rationale in a form that supports user trust and post hoc analysis [17, 18]. Formally , con ventional approaches optimize perception models for detection accuracy under ﬁxed assumptions: min ϕ E  L  f ϕ ( y t ) , x t  (1) where f ϕ denotes a perception model with parameters ϕ . Notably , this formulation treats sensor behavior as exogenous, omitting adapti ve control over θ s . While sensor adaptation and perception robustness are activ e research topics, they are orthogonal to the present work: this paper does not propose sensor adaptation or vehicle control mechanisms. Instead, it focuses on the subsequent layer , how higher-le vel semantic risk interpretation can di verge even when the input e vidence is held ﬁxed. T o address structural limitations in scenario handling, scenario-centric representations hav e been proposed in the literature, decomposing complex driving situations into dis- crete, comparable, and auditable units. Frameworks such as MetaScenario [6] formalize scenarios as ﬁrst-class entities, enabling systematic description, indexing, and retrie val of driving situations across datasets and experiments. While such representations signiﬁcantly improv e reproducibility and com- parability at the scenario level, they remain largely agnostic to how higher-le vel reasoning components interpret risk, intent, and causality within those scenarios. Consequently , the use of scenario-centric abstractions to explicitly expose semantic ambiguity and inter-model div ergence in risk interpretation remains an open problem. B. P er ception as a Context-Dependent Estimation Pr oblem From a systems perspective, perception in ADAS can be modeled as a state estimation problem in which the vehicle infers a latent scene state x t from noisy sensor observ ations y t [21]: y t = h ( x t , θ s ) + ν t (2) where h ( · ) represents the sensor observation model, θ s de- notes sensor conﬁguration parameters (e.g., camera exposure, ISO, shutter speed, radar range), and ν t captures measurement noise. In practice, θ s is commonly ﬁxed or adjusted via ofﬂine calibration and handcrafted heuristics, implicitly assuming quasi-stationary environmental conditions [20]. Real-world driving violates this assumption. En vironmental factors such as illumination, weather , road geometry , and trafﬁc density dynamically alter the effecti ve observa tion pro- cess, thereby degrading detection conﬁdence and segmenta- tion ﬁdelity when parameters remain static. This degradation propagates do wnstream, increasing the likelihood of missed detections, false alerts, or semantically ambiguous e vidence. Importantly , ev en when perception outputs are available, their 3 interpretation at the le vel of risk, intent, and hazard type may remain underdetermined, a property that becomes critical when assessing safety behavior in complex urban scenes [1, 2, 3]. In this work, these perception-related uncertainties are treated as upstream conditions that moti vate controlled analy- sis. The goal is not to optimize h ( · ) or adapt θ s , but to study how ﬁxed, scenario-bounded evidence can nonetheless yield div ergent high-level risk interpretations. C. Emer gence of Semantic Reasoning and Language Models Recent advances in Large Language Models (LLMs), par - ticularly multimodal variants, introduce a new capability: se- mantic reasoning ov er heterogeneous e vidence streams [7, 8]. Unlike traditional perception models that operate primarily in feature space, LLM-based components can integrate visual cues, vehicle telemetry , and conte xtual knowledge to generate structured interpretations of scene semantics, risk factors, and intent [9, 10]. Prior work has explored LLMs for post hoc interpretation, human–machine interfaces, and ofﬂine analysis in autonomous driving contexts [7, 8]. Although the broader literature has also discussed leveraging high-le vel semantic representations for adaptiv e sensing or orchestration, such control-oriented roles are explicitly outside the scope of this paper . Here, LLMs are treated strictly as interpretativ e and analytical components: giv en identical, scenario-bounded contextual e vidence, the objectiv e is to characterize ho w different models frame risk, select e vidence, and attrib ute causes in the face of semantic ambiguity . This framing emphasizes a distinct validation challenge: if LLM-based interpretati ve modules are to support safety argu- ments, their behavior must be consistent enough to be audited, or at a minimum, their di ver gences must be measurable and explainable under controlled conditions. D. Safety , SO TIF , and Cybersecurity F oundations for LLM- Enabled ADAS The integration of LLM-based components into AD AS must be framed within established automoti ve safety and security standards. In safety-critical vehicular systems, nom- inal correctness is insufﬁcient; behavior must be demon- strably safe, predictable, and robust under both fault condi- tions and performance limitations. Accordingly , the analysis of LLM-augmented ADAS architectures is commonly dis- cussed through the combined lenses of Functional Safety (ISO 26262) [4], Safety of the Intended Functionality , SO TIF (ISO 21448) [5], and Cybersecurity (ISO 21434) [22]. Rather than asserting that these standards “fail” to address perception limitations, a more precise interpretation is that they acknowledge uncertainty and performance limitations (particularly under SO TIF), while lea ving open the question of ho w semantic ambiguity should be exposed, compared, and audited in concrete engineering workﬂows. In this sense, the present work is complementary: it neither proposes a certiﬁed safety solution nor modiﬁes the standards. Instead, it pro vides a controlled analytical methodology that makes ambiguity and interpretativ e div ergence explicit, thereby supporting more structured safety arguments when LLM-based interpretativ e modules are considered. E. Pr oblem Statement and Researc h Gap The considerations abov e reveal a critical gap in current AD AS research and validation practices. While perception pipelines and emerging semantic reasoning components can produce rich interpretations of driving scenes, there is no established framew ork that systematically exposes, compares, and audits these interpretations under identical real-world conditions. In particular , without structured, scenario-centric representations, it becomes difﬁcult to isolate how semantic ambiguity in upstream evidence propagates into higher -lev el risk reasoning, and whether observed disagreements reﬂect intrinsic ambiguity or model-speciﬁc bias. This paper addresses this gap by adopting scenario-based semantic abstraction as the fundamental unit of analysis. Driving situations are represented as deterministic, tempo- rally bounded scenarios that can be reproduced, queried, and ev aluated consistently across experiments. This design enables systematic, scalable ev aluation, allo wing lar ge collections of identical traf ﬁc scenarios to be analyzed under controlled conditions without confounding v ariation in input evidence. W ithin this framework, identical scenarios are presented to different LLMs under ﬁxed prompts and bounded contextual inputs. Rather than treating disagreement as an isolated model failure, diver gences in semantic interpretation are explicitly captured as measurable analytical signals. The following sec- tions operationalize this approach through a multimodal data platform, a deterministic scenario-construction pipeline, and an ev aluation protocol designed to support reproducible, lar ge- scale inter-model comparison. I I I . D A TA S E T D E S C R I P T I O N A N D P R E - P RO C E S S I N G P I P E L I N E The effecti veness of the proposed framew ork relies on transforming raw , high-frequency sensor data into structured, scenario-centric representations that can be audited for se- mantic consistency . T o bridge the gap between low-le vel signal processing and high-lev el cognitiv e reasoning, we hav e dev eloped a data-centric architecture that normalizes and aligns heterogeneous input streams. This section details the multi-layered pipeline for consolidating these sources into a uniﬁed temporal substrate, ensuring that subsequent scenario- based ev aluations remain reproducible across div erse dri ving en vironments and model conﬁgurations. A. Multimodal Data Platform The platform is built upon a proprietary multi-modal dataset collected from instrumented vehicles operating in real urban en vironments, av ailable at carcara.com. The dataset integrates visual perception, vehicle telemetry , and external contextual information, and deliberately adopts a sensing conﬁguration that excludes LiD AR sensors. This architectural choice re- ﬂects a trade-of f that prioritizes reduced instrumentation cost, simpliﬁed deployment, and high replicability across heteroge- neous vehicles and urban contexts, while remaining sufﬁcient 4 to support semantic scene understanding and risk-oriented in- terpretation tasks. Lo wering the marginal cost and complexity of sensing is widely recognized as a ke y factor in enabling the collection of representative real-world driving data at scale, particularly when data acquisition spans multiple vehicles and long operational periods. The primary contribution lies in the architectural org ani- zation of heterogeneous data into a scenario-centric semantic abstraction that is independent of the speciﬁc sensing con- ﬁguration. In this sense, the proposed pipeline is not tied to a particular dataset or sensor suite; rather , it deﬁnes a generalizable representation strategy that can be applied to any multimodal driving dataset that provides synchronized perception, vehicle state, and contextual information. V isual modalities operate at higher acquisition rates and provide the primary source of perceptual grounding. V ehicle- mounted cameras capture video at approximately 15 fps. These visual streams are processed to e xtract object detections, semantic scene segmentation, and lane geometry estimates. Although computed at video frequency , these outputs are not retained as raw frame-level data, but as compact semantic descriptors suitable for multimodal alignment. Despite heterogeneity in sampling rates across modali- ties, all data streams are ultimately aligned to a common temporal resolution of 1 Hz. This temporal alignment con- stitutes a deliberate abstraction step that enables consistent multimodal fusion and subsequent scenario-based reasoning, while av oiding unnecessary dependence on high-frequenc y raw signals. Rather than treating the dataset as a collection of independent sensor logs, the platform or ganizes all av ailable information into a temporally structured semantic substrate, explicitly designed to support scenario-le vel interpretation and reproducible ev aluation. Figure 1 summarizes this abstraction process, illustrating ho w heterogeneous multimodal inputs are consolidated into a uniﬁed 1 Hz semantic representation. B. Pr ocessing and Fusion 1) Ingestion and Ofﬂine Heavy Pr ocessing: V isual percep- tion tasks include semantic scene segmentation, lane geometry extraction, and object detection, augmented by a tracking process that associates successiv e detections ov er time. This tracking stage introduces temporal continuity into the percep- tual representation by establishing persistent object identities across frames, thereby enabling the construction of dynamic entities whose evolution can subsequently be reasoned about at the semantic lev el. All visual processing stages are e xecuted Fig. 1: Uniﬁed Multimodal Dataset: Data from hetero- geneous vehicles are abstracted into a uniﬁed multimodal representation that jointly encodes visual perception, vehicle telemetry , and external contextual information within the mul- timodal data platform. entirely of ﬂine, as they operate at video frequency , incur high computational cost, and are therefore intentionally decoupled from any real-time or online reasoning loop. Let D t = { d 1 t , . . . , d N t t } be the set of detections at time t . T racking deﬁnes an association function a t : D t → I ∪ { ∅ } , which assigns each detection to a persistent identity i ∈ I (or ∅ for unmatched detections), thereby inducing temporal continuity across frames. 2) Internal Fusion: Follo wing modality-speciﬁc process- ing, the platform performs multimodal fusion to integrate complementary information sources into a uniﬁed semantic representation of the driving scene. This fusion stage operates deterministically , relying on temporal alignment and spatial consistency across modalities rather than on learned or heuris- tic decision logic, and is designed to consolidate heterogeneous observations into a coherent semantic structure that can serve as a stable substrate for subsequent scenario abstraction and higher-le vel interpretation. The fusion of object detection and tracking outputs yields persistent semantic entities, each asso- ciated with spatial attributes, temporal extent, and categorical information, and explicitly linked to lane geometry estimates. This association enables the deriv ation of spatial relationships such as lane af ﬁliation, relativ e position with respect to the ego vehicle, and proximity to road boundaries, transforming isolated detections into conte xt-aw are scene elements that retain temporal continuity and spatial meaning. 3) External contextual enrichment: In addition to dynamic perceptual inputs, the scene representation is enriched with ex- ternal contextual information obtained through external APIs. Static and semi-static road attributes, including road type, number of lanes, and side walk presence, are integrated from digital map services such as OpenStreetMap, while basic en vironmental descriptors, including weather conditions, are retriev ed from external weather APIs and provide complemen- tary situational context. Importantly , the multimodal fusion stage does not perform semantic reasoning, risk assessment, or interpretiv e judgment; its role is strictly limited to organizing and aligning heterogeneous data into a consistent seman- tic substrate, on which scenario abstraction and cogniti ve ev aluation can subsequently be conducted in a controlled, reproducible, and safety-oriented manner . C. Data Layer , Services and Platform Consumption 1) Normalized Data Layer: Following multimodal fusion and temporal alignment, all scene representations are consoli- dated into a single uniﬁed data layer organized at a ﬁxed tem- poral resolution of 1 Hz. Each temporal unit within the uniﬁed layer corresponds to a semantic snapshot of the dri ving scene. These snapshots aggregate ego-vehicle dynamics, persistent object representations, road and infrastructure attributes, en- vironmental conte xt, and acquisition-lev el metadata into a coherent structure that can be queried and recomposed de- terministically . By concentrating all relev ant modalities into a single temporally index ed representation, the platform enables complete scene reconstruction and interpretation from a single query , supporting both interactive exploration and automated 5 analysis workﬂows. The explicit separation between raw data, processed perceptual outputs, and the normalized semantic layer ensures that computationally expensiv e perception stages are not repeated, while preserving ﬂexibility for complex querying and scenario recomposition. In this organization, indexing is treated as a semantic and temporal construct rather than as a database-speciﬁc mechanism, enabling efﬁcient access to context elements and consistent reuse of identical scene representations across multiple ev aluation campaigns. This layered design supports scalability while preserving the methodological rigor required for reproducible scenario-based analysis. 2) Back end Data Services: Built directly on top of the normalized data layer, the platform backend provides online data services that expose multimodal scene representations in a controlled, efﬁcient, and reproducible manner . Rather than coupling consumers to implementation-speciﬁc storage details, the service layer can be formalized as a deterministic query operator ov er the uniﬁed temporal representation. The service layer supports context-oriented querying, enabling retrie v al of complete temporal scenes ﬁltered by semantic, spatial, and dynamic criteria, including road characteristics, object con- ﬁgurations, environmental conditions, and ego-vehicle behav- ior . In addition to ﬁne-grained ﬁltering, the backend ensures consistent retrie val and organization of temporal contexts, enabling identical scenario deﬁnitions to be accessed, reused, and compared across dif ferent analytical processes. Let U denote the normalized scene state at a ﬁxed tem- poral resolution of 1 Hz, indexed by acquisition identiﬁer a and discrete time t . Let θ denote a structured scenario query encoding the semantic, spatial, behavioral, and temporal constraints that deﬁne a scenario of interest. The backend ex ecutes this query ov er the normalized representation and materializes the corresponding scenario context S , which is a structured collection of normalized scene states index ed by the acquisition-time pairs selected by the query . This operation can be expressed as S = B ( U ( a, t ) , θ ) , (3) where B ( · ) denotes the deterministic backend query operator responsible for scenario construction. The set of valid ( a, t ) indices deﬁning the temporal extent of S is implicitly in- duced by the query speciﬁcation θ , ensuring that the scenario S materializes all database time instants whose normalized scene states satisfy the query constraints, such as speciﬁc vehicle types, illumination conditions, weather context, road en vironment, or trafﬁc density , while preserving their original temporal ordering. By operating on the normalized 1 Hz representation, the service layer leverages a temporally sparse yet semantically dense abstraction that is signiﬁcantly lighter than the original high-frequency sensor streams. This design creates the compu- tational margin required for ef ﬁcient online querying, scenario reuse, and scalable consumption across multiple platform interfaces. 3) Consumption Interfaces: The backend data and services are accessed via complementary platform interfaces that sup- port both interacti ve and automated workﬂo ws. The web- Fig. 2: Deterministic scenario construction and consumption workﬂo w . Normalized scene states U ( a, t ) , indexed at 1 Hz, are queried by backend data services to materialize scenario snapshots S ( a, t ) through the deterministic operator B ( · ) un- der a structured query speciﬁcation θ . Scenario construction combines semantic ﬁlters spanning visual perception, vehicle telemetry , weather context, and map information. The resulting scenarios are e xposed consistently to both the web-based authoring interface for selection and inspection and the local testing module for scenario ev aluation. based interface provides a user-oriented en vironment for vi- sual exploration of normalized scene representations, enabling synchronized inspection of perceptual outputs, contextual at- tributes, and temporal information for each acquisition. Within this interface, users can execute context-oriented queries to retriev e complete temporal scenes and curate selected tempo- ral units into persistent scenario groupings, which explicitly reﬂect analytical interests and ev aluation objectiv es while preserving the deterministic semantics of the underlying rep- resentation. In parallel, a local consumption interface supports external analytical modules that operate outside the web plat- form. This interface enables the systematic consumption of predeﬁned scenario groupings, exposing structured multimodal inputs to automated workﬂo ws such as model-driven analyses or language-model-based ev aluation pipelines. Formally , the web interface materializes the scenario con- text as a canonical artifact S , which serves as the uniﬁed representation of a curated driving scenario. This artifact is subsequently consumed by a local processing pipeline through a dedicated consumption operator, without altering its seman- tic or temporal structure. The relationship between interface, scenario, and consumption can be expressed as I web → S C local − − − → O , (4) where I web denotes the web-based authoring interface, C local ( · ) denotes a deterministic local consumption operator , and O rep- resents the structured outputs produced by automated analysis pipelines. Once scenarios are selected and materialized through the web interface, they transition from data-access abstractions into ex ecutable experimental units. In this stage, each canoni- cal scenario S is treated as an input to local processing work- 6 ﬂows, where additional structural transformations are applied to support controlled analysis and ev aluation. In particular, scenario e xecution operates over temporally extended contexts deriv ed from anchor instants selected during interactive ex- ploration, enabling the systematic reconstruction of dynamic scene ev olution around moments of interest. I V . S Y S T E M A R C H I T E C T U R E The system architecture deﬁnes how user-selected normal- ized multimodal data are transformed into executable scenario instances and systematically consumed by local analysis work- ﬂows. Rather than operating directly on full acquisitions, the platform adopts a scenario-based abstraction in which curated scenario contexts S are treated as atomic experimental units. These units are subsequently expanded in time and enriched with controlled contextual structure to support reproducible testing, model comparison, and behavioral ev aluation. A. T emporal Scenario Extend Although scenario deﬁnitions are selected through the web- based interface, temporal scenario extension is ex ecuted lo- cally over the canonical scenario contexts provided by the platform consumption layer . Each scenario S contains one or more normalized scene snapshots selected during interacti ve exploration and treated as semantic anchors for subsequent construction. A selected snapshot corresponding to a moment of interest is denoted as a temporal anchor t 0 . For each temporal anchor , the platform generates a scenario windo w by applying a conﬁgurable temporal expansion speciﬁed in sec- onds. In particular , users deﬁne ho w many seconds of context are included before and after the anchor instant by selecting the pre-ev ent and post-ev ent extents k and m , respectively . The resulting temporal windo w is deﬁned as ∆ T = [ t 0 − k , t 0 + m ] , where k and m denote the pre-event and post-ev ent temporal extents, respecti vely . This windowing strategy introduces con- trolled temporal context around the anchor moment, enabling the inclusion of dynamic ev olution immediately preceding and following the ev ent of interest. By explicitly deﬁning tempo- ral bounds during scenario generation, all scenarios follow consistent and reproducible structural rules, while allowing asymmetric temporal conﬁgurations when required by the analysis task. The resulting scenario window is composed of a compact sequence of normalized temporal units centered on the user- deﬁned anchor moment and constrained to the predeﬁned temporal span. All multimodal information av ailable in the normalized data layer within this interval is included, allowing each scenario to capture a coherent and self-contained repre- sentation of the trafﬁc situation without introducing additional preprocessing or inference stages. The scenario generation procedure is fully deterministic, meaning that identical tempo- ral conﬁgurations always produce identical scenario instances, which is essential for ensuring reproducibility across repeated analyses and experimental executions [23, 24]. As illustrated in Fig. 4, this controlled temporal expansion establishes each Fig. 3: Scenario windo w generation from a single user-deﬁned key moment ( t 0 ), illustrating controlled temporal expansion before and after the anchor instant. scenario as a stable and reusable analytical unit, forming the foundation for subsequent interpretation and ev aluation stages. B. Scenario Interpretation via Pr ompted LLM Evaluation In addition to the scenario-generation stage, the platform in- corporates an explicit prompt-engineering layer that primarily operates through the local consumption interface. This layer is responsible for transforming structured temporal scenarios into controlled analytical tasks suitable for language-model- based interpretation while maintaining strict alignment with the underlying normalized multimodal data. Prompt construction operates over ﬁxed temporal scenario windows. Based on this windo wed representation, the platform sup- ports different prompt conﬁgurations that deﬁne ho w each sce- nario is presented to the language model. These conﬁgurations include ﬁx ed prompt templates, designed to enforce standard- ized analytical tasks across all scenarios, and customizable prompt formulations, which allow alternativ e objectiv es or perspectiv es to be speciﬁed while preserving the same un- derlying scenario structure. In all cases, prompt content is systematically deriv ed from the normalized data associated with the temporal window , including ego-v ehicle dynamics, detected objects, road attributes, and en vironmental context. Decoupling scenario deﬁnition from prompting enables reuse across models and conﬁgurations. C. Pr ompt Speciﬁcation for Cognitive Risk The platform adopts a structured prompt speciﬁcation to constrain ho w language models represent and express traf ﬁc risk. Rather than producing free-form textual explanations, models must operate within a closed semantic space deﬁned by numeric codes, ensuring consistency , comparability , and machine-readability across all ev aluated scenarios and models. The risk representation space encodes multiple complemen- tary dimensions of traf ﬁc safety , including overall severity , 7 conﬂict types, e go-vehicle behavior , vulnerable road user pres- ence, temporal dynamics, uncertainty , and e vidence attrib ution. Each dimension is expressed exclusi vely through predeﬁned numeric identiﬁers, prev enting semantic drift and subjectiv e reinterpretation. Risk Encoding and Numeric T axonomy Overall Risk Severity (overall_risk_level) 0 Not identiﬁed; 1 Safe mode; 2 Lo w; 3 Moderate; 4 Ele vated; 5 High; 6 Critical Risk Occurrence Indicator (window_has_risk) 0 No risk identiﬁed (overall_risk_le vel = 0); 1 Risk identiﬁed (ov erall_risk_lev el = 1) Evidence Attribution Signals (evidence_signals) 1 Object presence; 2 Object distance; 3 Object–lane relation; 4 Speed; 5 Steering; 6 Braking; 7 Road/environment; 8 Image input Risk Category T ypes (risk_types) 2 Pedestrian; 3 Cyclist; 4 Rear-end; 5 Lateral conﬂict; 6 Intersection; 7 Speed; 8 V isibility; 9 Infrastructure; 10 T rafﬁc density Once the semantic space is ﬁxed, a global prompt constraint layer is applied to ensure deterministic behavior and strict epistemic discipline across all e valuated language models . This layer deﬁnes the role of the model, limits admissible e vidence sources, and enforces a rigid output format, independently of any speciﬁc risk semantics. Concretely , it operates as a ﬁxed prompt preﬁx that constrains all responses to a strictly structured JSON representation using only predeﬁned numeric codes, thereby transforming model outputs into machine- readable and directly comparable analytical artifacts. Prompt Constraints and JSON Output Enfor cement Y ou are an expert in urban trafﬁc risk assessment. Using ONL Y the information explicitly provided in the input (structured CAN per second, Y OLO detections including dist_m and lane_rel, basic context, and IF PRESENT any provided images). IMPOR T ANT : • Output MUST be valid JSON only . No markdown. No extra text. • Use ONL Y the numeric codes pro vided (do not output strings). • Do NOT in vent missing data. This structured schema transforms model outputs into di- rectly comparable analytical artifacts. D. Result Logging and T raceable Evaluation The ﬁnal stage of the platform is dedicated to the systematic recording, aggregation, and analysis of results from language- model-based workﬂo ws. This stage consolidates automatically captured execution metrics with structured e valuation outputs, enabling a comprehensiv e assessment of model beha vior under Fig. 4: Scenario-driv en workﬂo w of the platform, from web- based identiﬁcation of key moments to local temporal expan- sion and structured prompt construction. Curated scenarios are transformed into bounded temporal windows and indepen- dently ev aluated by multiple language models under identical conditions, thereby establishing a deterministic, reproducible path from interactiv e scenario selection to controlled model inference. controlled, reproducible, and scalable experimental conditions. All results are persistently stored and explicitly linked to their corresponding scenarios, temporal windows, and prompt conﬁgurations, ensuring full traceability across ev aluation runs and supporting longitudinal analysis. Formally , the outcome of an e valuation run is represented as R = L ( S e , M e , P e ) , (5) where L ( · ) denotes the integrated logging and ev aluation operator , and S e , M e , and P e correspond to the sets of sce- narios, language models, and prompt speciﬁcations effecti vely selected for a given experimental execution e . These sets are not assumed to be ﬁxed or e xhaustive; instead, the y are deﬁned according to the objecti ve of each ev aluation run. This abstraction deliberately leav es the internal ex ecution order and pairing strategy unspeciﬁed. Within this formu- lation, experimental designs may isolate individual factors, such as comparing dif ferent models under ﬁxed scenarios and prompts, or ev aluating prompt sensitivity for a single model, or jointly v ary multiple dimensions or repeat identical conﬁgurations to assess consistency and robustness across ex ecutions. By decoupling the conceptual deﬁnition of results from implementation-speciﬁc scheduling details, the formula- tion preserv es ﬂe xibility while maintaining strict traceability and comparability across ov erlapping ev aluations. V . E X P E R I M E N T A L S E T U P T o ev aluate the proposed dual-stage framework in interpret- ing complex driving en vironments, we conduct a series of 8 Fig. 5: Mosaic of representativ e sixteen scene anchors, that serve as temporal seeds for subsequent scenario expansion, illustrating the di versity of near -people situations across urban layouts, trafﬁc densities, lighting conditions, and pedestrian conﬁgurations. Fig. 6: Online visualization of the ev aluated test sets a v ailable in the platform. Each entry corresponds to a near-people ke y- moment collection e v aluated under identical conditions by dif- ferent language models, including two text-only conﬁgurations and one multimodal model. experiments focused on "near-people" urban scenarios. This section details the experimental conﬁguration, beginning with the protocol for selecting scenarios and curating data from the multimodal platform. W e then describe the speciﬁc LLM conﬁgurations utilized as cogniti ve probes, the design of the deterministic prompt templates, and the structured ev aluation metrics used to quantify model performance. By standardiz- ing these experimental parameters, we establish a controlled en vironment to assess how ef fectiv ely the framework handles semantic ambiguity and risk communication in safety-critical contexts. A. Collection and Evaluation Pr otocol T o ensure a controlled and semantically meaningful ev al- uation of language model behavior , all experiments were conducted over the same curated collection of ke y moments speciﬁcally designed to emphasize safety-critical urban inter- actions. Rather than operating ov er full-length acquisitions or arbitrarily sampled time windows, the ev aluation focuses on temporally anchored scenario fragments centered around the presence of vulnerable road users in close proximity to the ego vehicle. In particular , a dedicated near-people key- moment collection was constructed by identifying temporal anchors corresponding to situations in which pedestrians are spatially close to the ego v ehicle. For each anchor time instant, a ﬁxed temporal window spanning three seconds before and three seconds after the anchor was extracted, resulting in a standardized sev en-second scenario window . This anchoring strategy follows the temporal expansion formulation intro- duced in the scenario generation frame work, ensuring that each scenario captures both the contextual lead-up and the imme- diate e volution of the interaction. A total of sixteen distinct scenario windo ws were selected follo wing this criterion and are available at carcara.com/69460/scenario-selected. While larger -scale ev aluations in volving hundreds or thousands of scenarios are feasible within the platform, the present study deliberately adopts a small-scale experimental setting. This choice enables clearer interpretation of model behavior , facil- itates detailed qualitativ e and quantitativ e analysis, and serv es as an initial validation of the proposed ev aluation pipeline before large-scale deployment. Formally , the e v aluation set is deﬁned as a collection of N = 16 temporally anchored scenario windows, each centered on a key moment t ( i ) 0 and symmetrically expanded by three seconds before and after the anchor , ∆ T ( i ) = h t ( i ) 0 − 3 , t ( i ) 0 + 3 i , i = 1 , . . . , 16 , (6) yielding ﬁxed-duration sev en-second scenarios with the inter- action event centrally aligned. All selected scenarios were ev aluated under identical con- ditions by three different language models: a lightweight text-only model, a higher-capacity text-only model, and a multimodal model with visual grounding. Each model recei ved the same structured scenario representation and prompt spec- iﬁcation, ensuring that observed behavioral differences arise from intrinsic model characteristics rather than v ariations in input data or experimental setup. B. Evaluated Language Models Scenario ev aluations in this work adopt an organized batch- based e xecution strategy as a deliberate methodological choice, rather than as a ﬁxed or inherent constraint of the platform. This strategy is selected to enable controlled, repeatable, and scalable comparison of language model behavior across het- erogeneous dri ving scenarios, while preserving strict isolation between individual analytical contexts. Both text-only and multimodal models are considered, en- abling a systematic in vestigation of how access to visual infor - mation inﬂuences semantic scene interpretation, risk percep- tion, and the generation of assistance-oriented recommenda- tions. In the multimodal setting (GPT V ision), models recei ve raw visual inputs (images) jointly with structured, normalized textual descriptors, enabling internal encoding of visual cues 9 such as object appearance, spatial relationships, and scene layout and their direct integration into the reasoning process. In contrast, text-only models (GPT 4-Mini and DeepSeek-Chat) operate exclusiv ely on symbolic, structured textual represen- tations deriv ed from perception modules, relying on external scene descriptions rather than direct visual e vidence. V I . R E S U LT S A N D D I S C U S S I O N This section reports the empirical outcomes of a stan- dardized, scenario-based ev aluation that compares multiple language models under identical input and prompt constraints. For every model, the same prompt speciﬁcation and the same curated set of near-people temporal windo ws are pro vided, ensuring identical inputs and output constraints across runs. An online view of the ev aluated test sets is av ailable at carcara.com/6946/llm-tests. A. Model-Level Risk Interpr etation: Analysis Logic and De- rived Metrics Before presenting the quantitativ e comparisons, it is im- portant to clarify the analytical logic used to interpret the structured outputs produced by the ev aluated language models. Rather than treating the reported metrics as isolated statistics, the analysis follows a conditional reasoning scheme that maps numerical tendencies to qualitativ e behavioral proﬁles. This procedure enables a consistent interpretation of model behav- ior across identical scenario windows and prompt conditions. Algorithm 1: Conditional Semantic Analysis of LLM Risk Outputs Require: Parsed LLM outputs for a ﬁxed scenario set, risk threshold τ Ensure: Model-level qualitati ve risk characterization 1: Group all scenario windows by language model m 2: for all model m do 3: µ risk ← mean overall risk lev el 4: ρ high ← proportion of windows with risk ≥ τ (threshold-based) 5: µ evidence ← mean number of distinct evidence signals 6: F ← frequency distribution of dominant risk factors (primary category per window) 7: if µ risk is high then 8: Mark model as risk-conservative 9: else 10: Mark model as risk-tolerant 11: end if 12: if ρ high is high then 13: Indicate ele v ated sensiti vity to critical situa- tions 14: end if 15: if µ evidence is low then 16: Indicate narrow or underspeciﬁed reasoning 17: else 18: Indicate broader contextual grounding 19: end if 20: if F is concentrated on few factors then 21: Indicate specialized risk perception 22: else 23: Indicate diversiﬁed risk awareness 24: end if 25: end for The conditional logic summarized in Algorithm VI-A serves as an interpretati ve bridge between the raw structured outputs and the comparativ e analyses discussed in the follo wing sub- sections. By e xplicitly deﬁning ho w risk sev erity , threshold- based alert escalation, evidence usage, and dominant causal attribution are jointly interpreted, the analysis ensures that observed differences across models reﬂect distinct cognitive risk proﬁles rather than arbitrary metric ﬂuctuations. B. Overall Risk Assessment Behavior 1) Mean Over all Risk Level: The mean o verall risk le vel captures the baseline risk posture adopted by each language model when interpreting identical near -people scenarios. This metric reﬂects how conserv ati vely or permissiv ely a model classiﬁes safety-critical situations under ﬁxed semantic and contextual constraints. The e v aluated models exhibit clear stratiﬁcation in their baseline risk interpretations. The multi- modal model consistently assigns higher av erage risk le vels across scenario windows, indicating a more precautionary stance. In contrast, the text-only models show lo wer mean risk values, suggesting a more restrained interpretation of comparable situations despite operating on the same structured evidence. Although the number of e valuated scenario windows is limited, the observed ordering remains consistent across the dataset, indicating a stable baseline tendency rather than ran- dom variation. Notably , this div ergence emerges ev en though all models operate under the same semantic schema and prompt speciﬁcation, with the multimodal model dif fering solely through visual grounding rather than additional sym- bolic inputs. As illustrated in Fig. 7, these differences indicate that baseline risk interpretation varies systematically across mod- els, ev en when scenario content and prompt structure are held constant. This initial posture establishes distinct starting points for subsequent reasoning stages, thereby inﬂuencing how each model approaches subsequent alert-escalation and causal-attribution analyses. 2) High-Risk Escalation T endencies: Beyond baseline risk interpretation, the threshold-based proportion of scenario win- dows escalated to high risk provides insight into how often each model signals that situations require elev ated attention. In this analysis, high risk is deﬁned as scenario windows with an ov erall risk lev el of four or higher , indicating elev ated or more sev ere safety conditions. This metric captures alert-escalation tendencies rather than the a verage risk posture, approximating how often a cognitive module would trigger strong driv er- facing warnings under identical conditions. The ev aluated models exhibit clear differences in escalation behavior . The multimodal model escalates a larger fraction of scenario windows to high risk, indicating heightened sen- 10 Fig. 7: Mean ov erall risk le vel assigned by each e v aluated language model across all scenario windows. Fig. 8: Percentage of scenario windows classiﬁed as high risk (risk level ≥ 4 ) by each model. sitivity to potentially critical situations. In contrast, the text- only models adopt more selecti ve escalation strategies, clas- sifying fewer windows as high risk despite operating on the same structured e vidence and prompt constraints. This pattern suggests that visual grounding is associated with both higher perceiv ed severity and more frequent escalation of situations to alert-worthy states. Importantly , escalation frequency and mean risk lev el re- ﬂect related but distinct dimensions of cognitiv e behavior . A model may maintain a comparati vely elev ated baseline risk assessment while selectively escalating, or , con versely , escalate frequently despite moderate average risk scores. The observed div ergence indicates that alert escalation constitutes an independent behavioral axis, shaping how aggressi vely a model prioritizes safety-relev ant ev ents. From an AD AS design perspectiv e, these dif ferences sug- gest practical implications. Higher escalation rates may en- hance early hazard salience in complex urban en vironments, but they also increase the risk of alert fatigue if not calibrated using context-aw are thresholds or multi-stage alert policies. More conservati ve escalation strategies may reduce unneces- sary alerts, but risk delaying the emphasis of marginal yet safety-relev ant situations. These trade-offs are reﬂected in the distribution of high-risk classiﬁcations summarized in Fig. 8. 3) Evidence Usag e and Reasoning Br eadth: The mean number of distinct e vidence signals explicitly referenced by each model provides a proxy for reasoning breadth and for externalizing justiﬁcation. Operationally , this metric is com- puted as the number of unique evidence signals listed in the structured evidence priority ﬁeld of each model output. In this ev aluation, evidence signals correspond to structured contex- Fig. 9: Mean number of distinct evidence signals explicitly referenced in model outputs, reﬂecting reasoning breadth and justiﬁcation externalization. tual elements av ailable in the scenario representation, such as ego-v ehicle dynamics, object detections and distances, lane relations, road attributes, and en vironmental cues. Because the prompt enforces a constrained JSON schema, this metric reﬂects which signals the model chooses to externalize as explicit justiﬁcation rather than the full set of information it may have processed internally . The text-only models exhibit higher av erage e vidence counts, indicating a more enumerative explanation style that verbalizes a broader set of supporting cues. In contrast, the multimodal model references fewer distinct signals, producing more compact justiﬁcations. This difference suggests that access to visual input may reduce the need for explicit textual enumeration of contextual cues, as some salient information can be internally grounded in the image. As a result, a trade- off emerges between conciseness and traceability: broader evidence externalization improves auditability , whereas more compact outputs fa vor bre vity at the potential cost of trans- parency . From a methodological perspectiv e, these ﬁndings empha- size that explainability in LLM-based AD AS extends beyond decision correctness to encompass an externalization policy for evidence. Multimodal grounding may enhance perceptual richness and risk sensitivity , while simultaneously reducing the explicit articulation of non-visual cues. For driver -facing assistance, additional prompting or post-processing may be required to ensure that critical signals, such as object distance, lane-relativ e position, and ego dynamics, are consistently reported even when visual context is a v ailable. These trends are quantitatively summarized in Fig. 9. C. Risk F actor Attribution P atterns Beyond magnitude and escalation, the attribution of risk provides insight into ho w models frame causal explanations under identical evidence. While all ev aluated models operate on the same structured scenario inputs, they differ in ho w they assign primacy to speciﬁc risk sources, revealing distinct patterns of causal emphasis. In this analysis, the dominant risk factor corresponds to the primary causal category explicitly selected by the model for each scenario windo w . Across all models, pedestrian-related risk emerges as the dominant attribution, which is expected 11 Fig. 10: Distribution of dominant risk factors identiﬁed by each model. The six most frequent f actors are sho wn explicitly , with the remaining factors grouped as Other . giv en that the test set is intentionally curated around near- people interactions and reﬂects strong safety priors tow ard vulnerable road users. This shared primary attribution suggests a consistent baseline alignment with ADAS safety objectiv es when vulnerable road users are present. Howe ver , once the primary factor is ﬁxed, the attribution structure di verges substantially across models. Although only a single dominant factor is selected per scenario window , the rel- ativ e frequency of non-primary categories reveals systematic differences in secondary attrib ution tendencies across models. The DeepSeek text-only model most frequently elev ates rear- end conﬂicts as the next dominant category , suggesting a tendency tow ard interaction geometry and collision archetypes, inferred from object proximity and ego-v ehicle dynamics. The GPT -4o-mini text-only model, in contrast, more often selects lateral conﬂict as its dominant non-pedestrian category , indicating a stronger emphasis on side interactions and spatial lane-relativ e relations. The multimodal GPT -4o-vision model exhibits a qualitati vely different secondary attribution pattern, more frequently elev ating traf ﬁc density as the prominent non- pedestrian factor , reﬂecting a broader contextual framing in which global scene complexity becomes salient once visual cues are a vailable. These attributional dif ferences are summa- rized quantitativ ely in Fig. 10, which reports the distrib ution of dominant risk factors selected by each model across all e v alu- ated scenario windows. The ﬁgure makes explicit that, beyond the shared pedestrian-centric primary risk, each model system- atically privileges different secondary explanations giv en the same evidence. From a cognitiv e AD AS perspectiv e, these observed dif- ferences are nontrivial. Even when models agree that a sce- nario is risk y and con ver ge on the presence of a vulnerable road user , they may guide driver attention to ward different aspects of the en vironment. A model that foregrounds collision archetypes (rear-end or lateral conﬂict) may produce alerts that are more directly actionable for immediate defensiv e maneuvers, whereas a model that foregrounds trafﬁc density may emphasize situational vigilance and scene-level caution. These ﬁndings indicate that model choice affects not only how much risk is communicated, but also how surrounding contextual risks are framed, foreshadowing deeper sources of inter-model ambiguity in VR U presence interpretation. D. Inter-Model Ambiguity and Composite Uncertainty Acr oss Scenarios The analyses presented in the pre vious subsections rev eal consistent inter-model behavioral differences in baseline risk assessment, threshold-based escalation, evidence e xternaliza- tion, and dominant causal attribution, ev en under identical scenario inputs and ﬁx ed prompt constraints. While each of these dimensions provides an isolated view of model behavior , a more comprehensiv e picture emerges when these di vergences are jointly considered. In this work, inter-model ambiguity is therefore treated as a composite phenomenon arising from the combined interaction of multiple semantic and decision- related dimensions, rather than from disagreement in any single metric. T o operationalize this notion of ambiguity , a two-stage formulation is adopted. First, ambiguity is detected using a minimal div ergence criterion, in which any scenario window where at least one model deviates from inter-model consensus is ﬂagged as ambiguous. Second, the degree of ambiguity is quantiﬁed using continuous disagreement measures jointly deriv ed from the four previously analyzed dimensions: ov erall risk se verity , high-risk escalation, evidence usage breadth, and dominant risk factor attribution. This formulation enables a scenario-centric characterization of uncertainty that captures not only whether models disagree, but also ho w strongly and consistently such disagreement manifests across multi- ple dimensions. The resulting composite uncertainty score is normalized to the [0 , 1] interval, where 0 denotes full inter - model agreement across all ev aluated dimensions and models, and 1 denotes maximal di ver gence, corresponding to consistent disagreement across all dimensions. While ambiguity detection itself follo ws a minimal criterion, the associated uncertainty magnitude reﬂects graded, structured disagreement rather than a purely binary outcome. Across the e valuated test set of sixteen near-people sce- narios, the resulting composite uncertainty scores reveal a structured distribution of inter-model alignment rather than uniform disagreement. Based on the magnitude of the composite uncertainty , the scenario set partitions natu- rally into three uncertainty tiers. The lo w-uncertainty subset, denoted as S low = { S 3 , S 2 , S 14 , S 15 , S 4 } , corresponds to cases in which all ev aluated models remain largely aligned across the considered dimensions, reﬂecting situations in which semantic interpretation and risk assessment are consis- tently resolved. The intermediate-uncertainty subset, S med = { S 5 , S 9 , S 16 , S 1 , S 8 , S 11 } , reﬂects partial div ergence typically conﬁned to one or two dimensions, such as dif ferences in evi- dence prioritization or threshold-based escalation. Finally , the 12 Fig. 11: Radar-based summary of inter -model behavior across near-people scenarios, highlighting systematic differences in risk attribution, high-risk escalation, VRU presence interpretation, e vidence signal count (reasoning breadth), uncertainty e xpression, and relativ e token usage. The visualization illustrates how identical scenario artifacts yield di ver gent yet internally consistent interpretations across text-only and multimodal LLM conﬁgurations. high-uncertainty subset, S high = { S 6 , S 12 , S 7 , S 13 , S 10 } , con- centrates scenarios exhibiting pronounced inter-model div er- gence across multiple dimensions simultaneously , representing structurally ambiguous situations in which multiple semantic and decision-related interpretations remain plausible. In aggre- gate, these subsets satisfy ( |S low | = 5 , |S med | = 6 , |S high | = 5) , highlighting that inter -model ambiguity is graded and scenario- dependent rather than uniformly distributed. Figure 12 summarizes this analysis using a compact heatmap representation. Each row corresponds to a scenario, labeled by its original scenario identiﬁer and ordered from lowest to highest composite uncertainty , while each column corresponds to one of the e valuated language models. Color intensity reﬂects the per-model contribution to composite un- certainty , aggre gated across the four dimensions. By construc- tion, the composite uncertainty score is normalized to the [0 , 1] interval, where v alues close to 0 indicate strong inter-model alignment across all dimensions, and v alues approaching 1 indicate cumulati ve disagreement across multiple dimensions. This ordering enables a visual “certainty-to-ambiguity” tra- jectory , highlighting ho w inter-model di vergence increases gradually across scenarios rather than appearing as isolated outliers. From a methodological perspective, this composite uncertainty representation serv es as an e xplicit audit layer that bridges individual metric comparisons and holistic scenario in- terpretation. Rather than treating inter-model disagreement as noise or isolated anomalies, the proposed formulation rev eals structured, reproducible ambiguity patterns under controlled ev aluation conditions. From an AD AS and SOTIF perspectiv e, these ﬁndings emphasize that safety-relev ant uncertainty arises not only from perceptual limitations but also from diver - gent semantic interpretations of partially speciﬁed scenarios. Explicitly modeling and visualizing such graded ambiguity is therefore essential for scenario-centric validation, model auditing, and the design of rob ust decision-support pipelines that remain resilient under semantic indeterminacy . E. Discussion: Implications for Scenario-Centric A uditing and SOTIF-Rele vant V alidation The radar visualization in Fig. 11 provides a consolidated behavioral signature of the inter-model behavioral patterns Fig. 12: Compact heatmap of composite inter-model uncer- tainty across the e v aluated scenarios. Rows correspond to scenario identiﬁers, ordered from lowest to highest uncertainty (bottom to top), and columns correspond to the ev aluated language models. Color intensity indicates the per-model con- tribution to composite uncertainty , aggre gated over o verall risk sev erity , high-risk escalation, evidence usage, and dominant risk factor attribution. previously discussed, aggregating the primary risk-related di- mensions ev aluated across the near-people scenario set. Rather than introducing ne w measurements, this ﬁgure synthesizes the previously presented outcomes, enabling direct comparison of model-lev el risk proﬁles across identical scenario windows and ﬁxed prompt constraints. The summary highlights systematic div ergence across models, most notably in baseline sev erity attribution and high-risk escalation frequency . Consistent with the earlier analysis, the multimodal conﬁguration adopts a more precautionary stance, exhibiting a higher mean risk and more frequent escalations beyond the high-risk threshold, whereas the text-only models remain comparati vely conserv a- tiv e despite operating on the same structured scenario artifacts. This synthesis supports the core methodological premise that 13 controlled scenario reuse rev eals stable, model-speciﬁc risk- interpretation proﬁles that are both comparable and repro- ducible. These diver gences are consistent with SO TIF-style performance limitations, where hazardous outcomes can arise without malfunction when e vidence is incomplete or semanti- cally underdetermined. Beyond the primary risk dimensions, the visualization fur- ther incorporates uncertainty expression (as encoded in the JSON output) and relati ve token usage as secondary beha vioral indicators, offering deeper insight into how models opera- tionalize justiﬁcation and conﬁdence. T aken together, text- only models more explicitly externalize uncertainty while producing longer outputs, reﬂecting a strategy that fa vors broader evidence enumeration and explicit justiﬁcation. In contrast, the multimodal model exhibits lower relative token usage and more compact uncertainty signalling, suggesting a reliance on internalized perceptual grounding rather than extensi ve textual elaboration. This contrast rev eals a trade- off between conciseness and audit traceability: while shorter responses may enhance efﬁciency and decisiv eness, more verbose outputs can facilitate post-hoc inspection and ac- countability . In practice, this trade-off can be treated as a controllable design variable through prompt constraints and post-processing policies that enforce a minimum lev el of evidence disclosure for safety-critical alerts. Importantly , the joint analysis of risk severity , uncertainty , and token usage demonstrates that these dimensions are not independent; they jointly characterize distinct cognitiv e styles across models, supporting the role of LLMs as auditable cognitive probes in scenario-centric driving-assistance ev aluation. Notably , these ﬁndings do not motiv ate using LLM outputs as ground-truth perception, but rather as auditable interpreti ve probes whose div ergences can be measured, compared, and managed. V I I . C O N C L U S I O N This study analyzed ho w LLMs interpret identical urban driving scenarios under structured, scenario-centric ev aluation. Despite identical inputs and prompts, models exhibited sys- tematic diver gence in se verity assessment, escalation behavior , evidence use, and causal attribution. Disagreement e xtended to the interpretation of vulnerable road users, reﬂecting in- trinsic semantic indeterminacy rather than an isolated failure. These ﬁndings moti vate explicit ambiguity modeling when integrating LLM-based reasoning into safety-aligned AD AS validation. Future work will extend this frame work to large- scale scenario corpora and in vestigate how structured ambi- guity metrics can inform adaptive alert policies and safety assurance processes under SO TIF-oriented validation. R E F E R E N C E S [1] C. Sun, R. Zhang, Y . Lu, Y . Cui, Z. Deng, D. Cao, and A. Kha- jepour , “T oward ensuring safety for autonomous dri ving perception: Standardization progress, research advances, and perspectives, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 25, no. 5, pp. 3286–3304, 2024. [2] D. Feng, C. Haase-Schuetz, L. Rosenbaum, H. Hertlein, F . Timm, W . W iesbeck, and K. Dietmayer , “Deep multi-modal object detection and semantic segmentation for autonomous dri ving: Datasets, methods, and challenges, ” IEEE Tr ansactions on Intelligent T ransportation Sys- tems , v ol. 22, no. 3, pp. 1341–1360, 2021. [3] F . Sezgin, D. Vriesman, D. Steinhauser , R. Lugner , and T . Brandmeier , “Safe autonomous driving in adverse weather: Sensor ev aluation and performance monitoring, ” pp. 1–8, 2023. [4] Road vehicles — Functional safety , International Organization for Standardization Std. ISO 26 262, 2018. [Online]. A vailable: https://www .iso.org/standard/68383.html [5] Road vehicles — Safety of the intended functionality (SO TIF) , International Organization for Standardization Std. ISO 21 448, 2022. [Online]. A vailable: https://www .iso.org/standard/77490.html [6] C. Chang, D. Cao, L. Chen, K. Su, K. Su, Y . Su, F .-Y . W ang, J. W ang, P . W ang, J. W ei, G. W u, X. W u, H. Xu, N. Zheng, and L. Li, “Metascenario: A framework for driving scenario data description, storage and indexing, ” IEEE T ransactions on Intelligent V ehicles , vol. 8, no. 2, pp. 1156–1170, 2023. [7] L. Chen, O. Sinavski, J. Hünermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- lev el vector modality for explainable autonomous driving, ” in 2024 IEEE International Conference on Robotics and Automation (ICRA) , 2024, pp. 14 093–14 100. [8] J. Ge, C. Chang, J. Zhang, L. Li, X. Na, Y . Lin, L. Li, and F .-Y . W ang, “Llm-based operating systems for automated vehicles: A new perspectiv e, ” IEEE Tr ansactions on Intelligent V ehicles , vol. 9, no. 4, pp. 4563–4568, 2024. [9] J. D. Carv alho, H. T . Kenji, A. M. Saber , G. Melo, M. M. D. Santos, and D. Kundur , “Multimodal large language model framework for safe and interpretable grid-integrated e vs, ” pp. 1–5, 2025. [10] J. D. Carvalho, F . de Souza Forte, H. K. T aciro, G. Melo, and M. M. D. Santos, “Llm-po wered frame work for interpretable traf ﬁc rule processing in autonomous dri ving, ” in 2025 IEEE 34th International Symposium on Industrial Electr onics (ISIE) , 2025, pp. 1–7. [11] A. Mohammed and R. K ora, “ A comprehensive ov erview and analysis of lar ge language models: Trends and challenges, ” Ieee Access , 2025. [12] ——, “ A comprehensive ov erview and analysis of large language models: T rends and challenges, ” IEEE Access , 2025. [13] H. Zhu, K.-V . Y uen, L. Mihaylova, and H. Leung, “Overview of en vironment perception for intelligent vehicles, ” IEEE T ransactions on Intelligent T ransportation Systems , v ol. 18, no. 10, pp. 2584–2601, 2017. [14] T . Neumann, “ Analysis of advanced driver -assistance systems for safe and comfortable driving of motor vehicles, ” Sensors , v ol. 24, no. 19, p. 6223, 2024. [15] S. Sun, A. P . Petropulu, and H. V . Poor, “Mimo radar for advanced driv er-assistance systems and autonomous driving: Adv antages and challenges, ” IEEE Signal Processing Magazine , vol. 37, no. 4, pp. 98– 117, 2020. [16] D. J. Y eong, K. Panduru, and J. W alsh, “Exploring the unseen: A survey of multi-sensor fusion and the role of explainable ai (xai) in autonomous vehicles, ” Sensors , v ol. 25, no. 3, p. 856, 2025. [17] D. Omeiza, H. W ebb, M. Jirotka, and L. Kunze, “Explanations in autonomous driving: A survey , ” arXiv preprint , 2021, survey discussing need for transparency and trust in A V explainability . [18] A. Kuznietsov , B. Gyevnar , C. W ang, S. Peters, and S. V . Albrecht, “Ex- plainable ai for safe and trustworthy autonomous dri ving: A systematic revie w , ” IEEE T ransactions on Intelligent T ransportation Systems , 2024, survey on explainable methods in autonomous driving, emphasizing transparency and trustworthiness. [19] G. V elasco-Hernandez, D. J. Y eong, J. Barry , and J. W alsh, “ Autonomous driving architectures, perception and data fusion: A review , ” T echnical Review / Archive , 2020, review discussing modular components of autonomous dri ving system architectures. [20] H.-Y . Lin, Y .-C. Huang, J.-X. Lai, and T .-T . Y ou, “Domain adaptation for vehicle detection under adverse weather, ” IEEE Open Journal of Intelligent Tr ansportation Systems , vol. 6, pp. 568–578, 2025. [21] M. Aeberhard, S. Schlichtharle, N. Kaempchen, and T . Bertram, “Track- to-track fusion with asynchronous sensors using information matrix fusion for surround environment perception, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 13, no. 4, pp. 1717–1726, 2012. [22] Road vehicles — Cybersecurity engineering , International Organization for Standardization and SAE International Std. ISO/SAE 21 434, 2021. [Online]. A vailable: https://www .iso.org/standard/70918.html [23] O. E. Gundersen and S. Kjensmo, “State of the art: Reproducibility in artiﬁcial intelligence, ” in Pr oceedings of the Thirty-Second AAAI Confer ence on Artiﬁcial Intelligence , 2018. [24] J. Pineau, P . V incent-Lamarre, K. Sinha, V . Lari vière, A. Beygelzimer , F . d’Alché Buc, E. Fox, and H. Larochelle, “Improving reproducibility in machine learning research, ” Journal of Machine Learning Researc h , 2021.

Dual-Stage LLM Framework for Scenario-Centric Semantic Interpretation in Driving Assistance

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment