안전중요 객체 인식·예측·계획 종합 고찰
📝 Abstract
Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.
💡 Analysis
Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.
📄 Content
Recent progress in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across a wide spectrum of vision-language tasks, including image captioning, visual reasoning, and visual question an-swering (VQA) [1,38,62,71]. These capabilities have naturally extended into autonomous driving, where MLLMs are being explored for end-to-end driving pipelines, scenario generation, and vision-language action modeling [13,15,17,21,38,46]. Among these efforts, Driving Question Answering (Driving QA) has emerged as a promising paradigm for equipping autonomous agents with high-level scene understanding and decision-making capabilities.
In visual question answering, the design of the dataset is the primary factor that determines the measurable capacity of models. [3,24,41]. Early efforts in Driving QA datasets, such as DRAMA [39] and DriveLM [49], adopted a singleview setting. However, the increasing complexity of realworld driving and the advent of multi-sensor perception systems have highlighted the importance of multi-view scene understanding. This led to the development of datasets like NuScenes-QA [44] and NuPlan-QA [43], which integrate multi-camera inputs to better capture spatial relationships and inter-agent dynamics, reflecting the perceptual richness of modern autonomous systems.
Despite this progress, existing datasets [14,31,35] still focus primarily on normal scenarios and overlook safetycritical events [63]. These events are rare but highly consequential. Examples include pedestrians suddenly emerging from occlusions, abrupt braking triggered by unforeseen obstacles, or conflicts between multiple agents at intersections. The true reliability of an autonomous driving system is not determined by its performance in normal scenarios but by its behavior in these unpredictable, dangerous moments, specifically safety-critical scenarios. Recent works such as NAVSAFE [50] have recently recognized this and emphasized the importance of evaluating models on safetycritical scenarios. Accordingly, recent benchmarks such as DVBench [63] and VRU-Accident [27] developed safetycritical driving QA benchmarks.
However, these works [27,63] have three key limitations. (L1) They are designed exclusively for evaluation and do not include training data, preventing MLLMs from learning safety-critical reasoning directly from examples. A dedicated training dataset is essential for MLLMs to learn safety-critical reasoning. (L2) Like the early Driving QA works, they rely on single-view inputs, which limits the understanding of complex safety-critical scenarios. A singleview system can miss blind spots or occluded agents. (L3) Consequently, and most critically, they lack support for high-level reasoning across the full autonomy stack, from perception to behavior prediction and risk-aware motion planning. As shown in Fig. 2, the same scene yields opposite decisions under different view coverage. With a single front view the model concludes that a lane change is both feasible and safer. With multi view the rear right vehicle becomes visible, the lane change becomes infeasible, and slowing down becomes safer. This failure arises because Is lane changing possible? Which is safer, slowing down or changing lanes?
Changing lane is impossible due to white vehicle on rear-right. Slowing down is safer.
Changing lane is possible, changing lane is safer.
Question: limited view coverage produces an incomplete feasibility set and an incorrect risk ranking, which manifests as missing action and consequence reasoning. Such high-level reasoning, which is essential in safety-critical scenes, is largely missing from existing approaches.
To address these limitations, we introduce WaymoQA, the first training-enabled, safety-critical multi-view driving QA dataset. Built from long-tail scenarios in the Waymo End-to-End dataset [59] and filtered using U.S. NHTSA (National Highway Traffic Safety Administration) [59] safety criteria, WaymoQA consists of 35,000 human-written question-answer pairs, covering safetycritical videos and key frames that are manually selected.
To address (L1) the absence of a training set, we release WaymoQA with 28,585 training set and 6,415 test set. The dataset supports two formats, multiple-choice questions for test and open-ended questions for train. Both are included in the training set to enable in-distribution validation.
To tackle (L2) the information loss from single front frame view, WaymoQA resolves scenes using multi view inputs at each time step. We aggregate eight synchronized views (front left, front center, front right, side left, side right, rear left, rear center, rear right) so that occlusions are reduced and the spatial field of view is expanded. This multi-view configuration expands the field of view, admits more evidence, and yields more accurate scene understanding, thereby providing the evidential basis required for reliable high-level reasoning. In short, the multi-view design
This content is AI-processed based on ArXiv data.