Reading time: 25 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.20299
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a humanpreference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.

๐Ÿ“„ Full Content

Autonomous driving, a key application of AI, serves as a comprehensive testbed for 3D perception, prediction, reasoning, and planning capabilities. Current state-of-the-art autonomous driving systems mainly follow two paradigms: (a) End-to-end autonomous driving, which leverages 3D feature extraction to directly decode planning trajectories; and (b) Vision-Language-Action (VLA) models, which integrate perception and control through multi-modal large language models. However, an ideal autonomous driving system must operate in open, dynamic, and highly uncertain environments, requiring decisions that are safe, reli- Comparison of End-to-End, VLA, and our KNOW-VAL. Our 3D vision system enables visual-language reasoning through the mutual guidance of perception and retrieval, understands law and morality via knowledge retrieval, and integrates values through a dedicated value model.

able, and explainable while adhering to human social values and norms, which poses significant challenges to existing paradigms.

The first challenge lies in enabling effective interaction between vision and language to achieve genuine visual-language reasoning. Existing end-to-end autonomous driving models lack language-grounded reasoning capabilities, while VLA models typically confine reasoning to chain-of-thought or purely linguistic processes, without allowing the reasoning outcomes to influence perception. As illustrated in Figure 1, we propose a visual-language reasoning system in which perception and retrieval mutually guide each other: the model retrieves relevant knowledge from open-world perception, and when retrieval identifies missing information, it prompts the perception module to refine its observations accordingly.

The second challenge lies in the absence of an interpretable and structured knowledge base. Existing end-toend and VLA models primarily rely on data-driven learning, attempting to infer complex decision logic from the limited variability of human behavior-an inherently difficult task. Some approaches introduce handcrafted conditional rules or reinforcement learning rewards, but these manually designed components cover only narrow scenarios and fail to generalize to complex, dynamic environments. In contrast, autonomous driving already has a rich and authoritative source of knowledge: traffic laws and regulations, which provide a comprehensive foundation for safe and lawful driving. Furthermore, abundant online resources, such as instructional driving videos, offer valuable insights into driving techniques and principles. Hence, we fully exploit these resources to construct a knowledge graph of traffic laws, moral principles, and defensive driving behaviors, enabling dynamic knowledge retrieval via perception queries. This allows the model to interpretably reason about traffic laws, moral principles, and defensive driving behaviors.

The third challenge concerns the autonomous driving system’s worldview and values. Recent advances highlight the importance of a world model-a module that predicts future world states from current observations and actions-reflecting the predictive and reflective abilities of the autonomous driving systems. However, possessing such a worldview (i.e., world model) alone is insufficient for effective decision-making: after forecasting possible futures, the system must also judge whether those outcomes are desirable, requiring a dedicated value assessment mechanism. Existing paradigms often approximate this ability through data-driven learning or handcrafted rules, both of which limit generalization and interpretability. To address this, we introduce a Value Model grounded in retrieved knowledge and trained on a human-preference dataset, enabling valuealigned and interpretable trajectory evaluation.

To sum up, we propose a novel autonomous driving system in this work, in which perception and knowledge retrieval mutually guide each other, supplying richer contextual information to the decision-making module that integrates both a world model and a value model. The key contributions of this work are:

โ€ข We propose a new autonomous driving system, Know-Val, that enables visual-language reasoning through the synergistic interaction between perception and knowledge retrieval.

Atonomous Driving Systems. The prevailing paradigms for autonomous driving systems can be broadly categorized into three classes: the traditional modular pipeline, the endto-end approach, and the Vision-Language-Action (VLA) model. Several models have been benchmarked on realworld datasets such as nuScenes [1]. ST-P3 [6] first integrates perception, prediction, and planning into an endto-end framework. UniAD [7] follows a BEV-based feature extraction strategy similar to BEVFormer [15], with intricate feature propagation across task-specific decoders. VAD [12] adopts a fully vectorized scene representation for trajectory planning, while PARA-Drive [25] further investigates multi-task supervision and module interdependencies within end-to-end architectures. GenAD [32] utilizes BEV representations to obtain sparse detection and mapping features for planning, whereas SparseDrive [23] extracts such features directly from image inputs to mitigate cumulative errors and improve prediction accuracy. Recently, Mo-mAD [22] introduces momentum-aware planning for enhanced temporal consistency, and BridgeAD [31] refines planning results by leveraging historical prediction and trajectory information. Finally, DiffusionDrive [16] employs a truncated diffusion model to generate diverse trajectory candidates, while HENet++ [28] jointly extracts foreground instance and dense panoramic features using attention mechanisms to improve prediction and planning. With the advancement of 3D physical simulators, many recent works [7,12,16] leverage synthetic environments such as CARLA [2] and Bench2Drive [10] for training and evaluation. Despite this progress, these methods still largely rely on imitation learning paradigms. TCP [26] introduces a dual-branch architecture that integrates trajectory and control signals to enhance planning accuracy. ThinkTwice [9] proposes a two-stage planner that first generates candidate trajectories and then refines them based on scene context, partially reflecting the concept of a world model. DriveAdapter [8] utilizes an expert planner for knowledge distillation, improving planning robustness. AD-MLP [30] extracts priors from historical trajectories using multilayer perceptrons, while DriveTransformer [11] adopts parallel queries for sparse interactions with sensor data in streaming multi-task learning. With the emergence of Vision-Language Models (VLMs), several methods have adapted them for driving command generation, among which ORION [4] is representative. SimLingo [19] further aligns natural-language prompts (e.g., “adjust speed,” “change lane”) with corresponding action sequences, advancing interpretability and language-conditioned control.

However, these approaches lack vision-language reasoning, explicit knowledge utilization, and value-based judgment. Our proposed system overcomes these limitations through perception-guided knowledge retrieval and knowledge-guided value assessment, enabling more interpretable and value-aligned decision-making. Retrieval Method. To align systems with complex knowledge principles, integrating external knowledge is essential. Early retrieval-based methods, such as [18], adopt a simple linear pipeline-indexing documents, retrieving the most similar chunks to a query, and generating responses based on them. The introduction of Retrieval-Augmented Generation (RAG) [13] improves this process by incorporating preretrieval and post-retrieval stages to enhance contextual relevance and generation quality. Modular RAG [20,29] extends this idea by decomposing the linear RAG pipeline into a flexible architecture with specialized modules arranged in dynamic configurations. GraphRAG [3] constructs a summarized knowledge graph over text corpora, providing global semantic structure and enabling reasoning beyond vector search. More recently, LightRAG [5] introduces a dual-level retrieval mechanism over a graph-enhanced index that supports incremental updates, allowing efficient and adaptive RAG without expensive re-indexing.

However, the aforementioned retrieval-augmented methods, though effective for language models, cannot be directly applied to autonomous driving systems. They inevitably paraphrase and compress retrieved knowledge, which can introduce hallucinations and distort factual content. Moreover, these approaches are primarily designed for chat-based AI, where retrieval is driven by user-issued textual prompts. To meet the unique demands of autonomous driving, we propose a perception-guided retrieval mecha-nism that grounds knowledge access in visual perception rather than user input. In addition, we refine the retrieval process to ensure that retrieved entries remain faithful to the original texts, thereby preserving factual accuracy and reliability in downstream reasoning and decision-making.

As illustrated in Figure 3, KnowVal comprises three main components: (i) retrieval-guided open-world perception, (ii) perception-guided retrieval from the knowledge graph, and (iii) planning with world prediction and value assessment. KnowVal achieves visual-language reasoning through the mutual guidance between perception and retrieval, thereby enriching both perceptual understanding and knowledge utilization. Information from these two modules is subsequently propagated to the planning component as structured representations and feature embeddings, ensuring comprehensive decision-making while maintaining an end-to-end architecture. Retrieval-guided Open-world Perception. Open-world perception in KnowVal encompasses three key capabilities. The first is Specialized Perception, a standard component in most end-to-end autonomous driving models, which enables the recognition and localization of common semantic categories such as vehicles, pedestrians, and drivable areas. The second is Open-ended 3D Perception, which identifies and localizes long-tail or uncommon ob-

When turning, ease off the gas and cover the brake for pedestrians.

Drive the car slowly through puddles to avoid splashing othersโ€ฆ

Slow down and honk for unaware pedestriansโ€ฆ . Pre-construction of the knowledge graph and retrieval process in KNOWVAL. We collect diverse driving-related resources-including laws, regulations, defensive driving principles, moral guidelines, and experiential knowledge-to construct an initial knowledge forest based on textual structures. Large language models (LLMs) are then used to extract entities and define vertices and edges, forming a structured knowledge graph. During inference, KNOWVAL generates queries enriched with 3D perception information to retrieve and rank relevant entries from the knowledge graph by descending relevance.

jects (e.g., fire trucks, standing water) without requiring explicit user prompts. This capability is implemented using VL-SAMv2 [17] and OpenAD [27]. The third is Abstract Concept Understanding, which captures scene-level or contextual attributes beyond direct object perception-such as determining whether the environment is a bridge or tunnel, or whether it is daytime or nighttime-achieved via the VLM within VL-SAMv2 [17]. Additionally, this module supports retrieval guidance: the retrieval component from the previous timestep identifies elements requiring further perception, which are then passed to the VLM for targeted processing.

Perception-guided retrieval from knowledge graph. We collect traffic laws, regulations, and driver interview transcripts, and employ multimodal large language models (MLLMs) to extract experiential knowledge, ethical considerations, and defensive driving principles from online video sources. Subsequently, we utilize LLMs to construct a knowledge graph, as described in Section 3.2.1. During inference, a Perception Verbalizer transforms structured perception outputs into query texts, which are then used by an LLM-based retrieval module to obtain knowledge entries ranked in descending order of relevance. Simultaneously, the retrieval module identifies elements requiring additional perception, which are forwarded to the perception component in the next timestep for refinement. The retrieval process is further detailed in Section 3.2.2.

Planning with world prediction and value model. Current planning modules predominantly adopt Transformer or RNN architectures, which inherently possess varying de-grees of capability for future-state prediction. Building upon this foundation, we enhance both the planning and world models to generate diverse candidate trajectories and forecast corresponding future states. We then introduce a Value Model that evaluates the desirability of these candidate trajectories and their predicted world states based on retrieved knowledge entries, as detailed in Section 3.3.1.

To train this model, we construct a preference dataset comprising extensive trajectory-outcome samples, further described in Section 3.3.2.

3.2. Perception-guided Knowledge Retrieval.

We design a knowledge graph construction method that preserves the complete fidelity of original knowledge clauses while simultaneously revealing latent semantic relationships between these clauses and query entities.

Step 1: Knowledge collection and structured forest construction. We gather information from a variety of sources, including traffic laws and regulations, principles extracted from educational videos on defensive driving, transcripts from interviews with various drivers, and compiled ethical guidelines. For educational videos on defensive driving, we employ Gemini 2.5-Pro with carefully designed prompts to extract key driving principles. The extracted texts are then categorized and organized into chapters according to the driving scenarios and strategies they describe.

For driver interview transcripts, we classify the content by driving behaviors and contexts (e.g., left turns, intersections) to produce a structured textual representation. In the

Planning process of KNOWVAL. We design a trajectory generation method compatible with existing Transformer-and RNNbased planners. By combining future world state prediction with value model assessment, KNOWVAL produces diverse candidate trajectories and makes interpretable final decisions. case of laws and regulations, their inherent hierarchical organization already exhibits a forest-like structure, which we directly adopt as the foundation for our knowledge forest. This design allows us to directly utilize their intrinsic hierarchical organization to form the knowledge forest. Each knowledge clause node stores the raw, unaltered text, while non-leaf nodes contain the corresponding titles and descriptive information. All nodes in the forest are labeled as native. This tree-based organization preserves the original hierarchical structure of the knowledge corpus and serves as the initial form of the knowledge graph.

Step 2: Entity Linking via LLM to Transform the Forest into a Graph. To break down the “knowledge silos” among independent trees, we leverage LLMs to automatically discover and link semantic relationships across the disparate branches of the knowledge forest. We iterate through each knowledge clause leaf node and employ a powerful LLM (e.g., GPT-4 or Qwen) to perform a controlled entity extraction task. Our prompt instructs the model to identify key entities within the clause text corresponding to predefined conceptual categories, including: Traffic-Sign-Device (e.g., traffic signal, stop sign), Road-User (e.g., pedestrian, bicycle), Driving-Maneuver (e.g., turning, merging), and Road-Condition (e.g., wet pavement, construction zone). After entity extraction, we create a new entity node in the knowledge graph for each unique entity and connect it to all corresponding native nodes containing that entity. The LLM then assigns semantic weights and relationship descriptions to the edges. Inspired by [5], we further employ an LLMempowered profiling function to generate text keys for both nodes and edges. Each key is a concise phrase facilitating efficient retrieval, and multiple keys can be generated to encode broader semantic themes across connected enti-ties. This process transforms the initially isolated knowledge forest into a highly interconnected, semantically rich knowledge graph, forming the structural backbone for the online retrieval mechanism described in Section 3.2.2.

We design a retrieval process that ranks results by importance and relevance. This process guarantees retrieval results faithful to the original text, thereby safeguarding against model hallucinations.

Step 1: Perception verbalizer. First, a templated verbalizer converts perception outputs of varying structures into a unified structured natural language. For the 3D instances, the verbalizer directly converts the bounding box and semantic information into textual descriptions. For the semantic map or occupancy results, the verbalizer first uses the Breadth-First Search algorithm to identify semantically connected blocks, handling discontinuous boundaries via a dynamic connectivity threshold. Upon obtaining connected blocks, it also converts their positional and semantic information into textual descriptions, which are then combined with abstract conceptual perception results, navigational data, and user instructions to form the retrieval query.

Step 2: Entities Extraction. We leverage a lightweight LLM to perform a two-layer keyword extraction from the retrieval query. The first layer extracts macro-level context keywords (e.g., Driving Security, Scene Analysis), while the second layer extracts specific entity and event keywords (e.g., Pedestrians, Rainy Weather). Subsequently, these keywords are used to query initial relevant entity nodes and edges within the knowledge graph by the index keys. To capture richer contextual information, we perform a Top-K nearest-neighbor expansion on the graph, incorporating nodes closely connected to the initial set. Step 3: Knowledge Items Filter and Embedding. From all extracted nodes, we filter for only those explicitly tagged as native. These nodes link directly to the original, verbatim knowledge clauses that have not been processed or summarized by any LLM. This filtering step is the fundamental differentiator of our system from conventional RAG, as it completely eliminates the risk of information distortion introduced by LLM processing. The top-N K relevant filtered items are converted into vector representations in this process, making them suitable for subsequent use in the Value Model. Meanwhile, the retrieval LLM outputs a list of items for supplementary perception, which are then fed back into the subsequent perception stage.

As illustrated in Figure 4, we first extend the planner to function as both a planning and world model, generating diverse trajectories and predicting future states. Taking HENet++ [28] as an example, we inject N T distinct Gaussian noise vectors into the ego-vehicle features, thereby generating N T ego-specific queries that decode into N T distinct trajectories. The baseline end-to-end model is finetuned with these modifications and an additional diversity constraint: when the L2 distance e between two planned trajectories falls below a predefined threshold ฯ„ , the term (ฯ„ -e) is added to the loss function as a penalty. This encourages the planner to explore a wider range of feasible trajectories while maintaining consistency and realism.

Next, for each candidate trajectory T i, KNOWVAL performs item-wise value assessment based on the retrieved knowledge entries K j . The Value Model follows a Transformer Encoder-MLP Decoder architecture. For the ith trajectory and the j-th knowledge entry feature f Kj โˆˆ R Ntokensร—C , we take max(f Kj ) โˆˆ R C and concatenate it with the positional encoding of T i and its ego-vehicle fea-tures to form the query Q i,j . Through this process, N K query tokens are generated for each candidate trajectory. These tokens are projected linearly and fed into a Transformer layer, while the future-state tokens S i -which incorporate ego-vehicle, instance, and occupancy-grid features, each with a positional embedding corresponding to T i-serve as key-value pairs. After L Transformer iterations, the updated N K features are decoded by the MLP to produce N K scalar values s i,j โˆˆ [-1, 1]. A value near -1 indicates behavior inconsistent with knowledge-based reasoning, 1 represents positive compliance, and 0 denotes irrelevance to the rule.

We then apply a weighted-decay scoring strategy to aggregate the per-rule evaluations into a single score for each trajectory. The total value is defined as a normalized weighted average of scores across all retrieved rules, sorted by their relevance:

where Z = N K j=1 ฮณ j-1 is a normalization constant. This weighting ensures that the most relevant knowledge clauses contribute most significantly to decision-making, enabling interpretable, knowledge-driven trajectory evaluation. Finally, the trajectory with the highest overall score is selected as the final planning output.

To train and evaluate the Value Model, we construct the Preference Dataset for Values, which provides quantitative supervision for the abstract notion of whether a driving behavior complies with a specific knowledge clause. The dataset consists of 160K trajectory-knowledge pairs, each accompanied by corresponding scene states (feature vectors and BEV renderings) and ground-truth value annotations. We use 80% of the data for training and reserve the remaining 20% for validation and testing.

Step 1: Future States and Retrieved Basis. Our data construction pipeline begins by employing the retrieval engine described in Section 3.2.2 to obtain a set of relevant knowledge clauses Basis = K 1 , K 2 , . . . for each driving scenario in nuScenes [1]. We then collect the corresponding futurestate vectors, BEV renderings of perception and prediction results, and other contextual information for each scenario.

Step 2: Annotation Scores for Supervision. We design precise prompts and employ Qwen-VL-Max to automatically generate preliminary compliance scores for each data sample. These scores serve as supervision signals, indicating how well each driving behavior aligns with the corresponding knowledge clauses. To ensure annotation reliability, we conduct a manual review to identify and correct any apparent errors or outliers. During training, the loss function is defined as the mean squared error (MSE) between the predicted value and the annotated score in the constructed dataset.

We evaluate our approach on two widely used autonomous driving benchmarks: the open-loop real-world dataset nuScenes [1] and the closed-loop simulation benchmark Bench2Drive (B2D) [10]. nuScenes includes sensor data from six surround-view cameras, radar, and LiDAR across 1,000 driving scenes. Its planning metrics consist of the L2 error between predicted and human trajectories and the collision rate. Bench2Drive (B2D) provides 10,000 sampled driving segments for training and offers closed-loop evaluation environments with detailed simulator configurations. B2D employs two key metrics: Driving Score and Success Rate. The Driving Score is a composite measure that integrates Route Completion and Infraction Penalty, while the Success Rate reflects the percentage of successfully completed driving tasks.

We employ Qwen2.5-3B for retrieval and knowledge embedding, with the number of retrieved knowledge entries set to N K = 16. For the planner, the number of candidate trajectories is N T = 20, and the Value Model performs L = 3 iterative reasoning steps. The decay factor used in the total value assessment is ฮณ = 0.7. To increase trajectory diversity, baseline models on nuScenes are finetuned for three epochs using eight A100 GPUs, while models on Bench2Drive are fine-tuned for one epoch under the same configuration. The Value Model is trained separately on eight V100 GPUs for 50 epochs using the AdamW optimizer with a cosine annealing learning rate schedule. the-art performance on Bench2Drive. These results demonstrate the effectiveness of our knowledge-grounded planning framework, enabling the autonomous driving system to make safer, more robust decisions in complex and dynamic driving environments.

Figure 5 illustrates a representative example of the inference process in our proposed KnowVal, highlighting its reasoning effectiveness. Comparative results with baseline methods are presented in Figure 6. Compared to existing approaches, our paradigm exhibits a superior ability to interpret and act upon ethical and legal considerations during decision-making. It is important to note that current benchmarks lack both sufficient evaluation data and appropriate metrics to assess many forms of improper driving behavior.

In the examples shown in Figure 6, actions such as failing to yield to pedestrians or overtaking within tunnels do not affect standard metrics like collision rate or driving score, underscoring the limitations of existing evaluation protocols.

Ablation on Overall System. As shown in nents, as well as incorporating open-world perception and retrieval-guided supplementary perception, consistently reduces the collision rate. Although the deviation from human driving trajectories increases slightly, this does not imply a decline in planning quality. We further analyze the effects of varying the numbers of knowledge entries and candidate trajectories. As shown in the middle and lower sections of the augmented the dataset with numerous randomly generated low-quality trajectories as negative samples, significantly increasing both scale and diversity.

This paper presents KnowVal, a new autonomous driving system that achieves visual-language reasoning through the synergistic interaction between open-world perception and knowledge retrieval. Specifically, we construct a comprehensive knowledge graph for autonomous driving that integrates traffic laws, defensive driving principles, and ethical considerations, along with an efficient LLM-based retrieval mechanism tailored to driving contexts. To enable value-aligned decision-making, we curate a human-preference dataset and train a Value Model that provides an interpretable foundation for trajectory evaluation. Extensive experiments demonstrate that KnowVal substantially enhances planning performance while maintaining full compatibility with existing methods, achieving the lowest collision rate on nuScenes and establishing state-of-the-art performance on Bench2Drive.

To better illustrate the role of our proposed and trained Value Model, Figure 7 showcases several examples and visualizes ego vehicle trajectories across two scenarios, where trajectories 2 and 4 were manually constructed as negative samples and the black trajectory in the center represents the ego vehicle. In the first case (top), the first trajectory successfully avoids pedestrians and the leading vehicle, whereas the trajectory in the second image heads directly toward them. A comparison of the predicted scores for knowledge entries 1, 3, and 5 demonstrates that the Value Model has successfully learned the concepts of “maintaining a safe following distance” and “respecting and yielding to pedestrians.”

In the second scenario (bottom, images 3 and 4), a large truck is present ahead of the ego vehicle. According to defensive driving principles, one should maintain a distance from large, risky vehicles. The comparison of predicted scores for knowledge entries 1 and 4 indicates that the Value Model has effectively learned the concepts of “maintaining a safe following distance” and “avoiding large vehicles.” In both scenarios, the scores predicted by our Value Model closely align with the GT.

The scores predicted by our value model in both scenarios align closely with the ground truth. This demonstrates the model’s ability to effectively gauge knowledge compliance across various contexts and assign reasonable scores accordingly.

Retrieved Knowledge Entries Prediction GT

  1. Yield to pedestrians and non-motor vehicles.

  2. Do not honk to rush vulnerable pedestrians, including the elderly, disabled, or pregnant women.

  3. Maintain a safe following distance at all times to prevent rear-end collisions.

  4. Treat unexplained slow driving or frequent braking as a risk. Increase distance and overtake when safe.

  5. Watch for sudden crossings, especially on roads without crosswalks. Cover the brake upon seeing pedestrians and be ready to stop. 0.7 1.0 0.4 0.0 1.0 1.0 0.9 0.9 0.9 0.9

  6. Yield to pedestrians and non-motor vehicles.

  7. Do not honk to rush vulnerable pedestrians, including the elderly, disabled, or pregnant women.

  8. Maintain a safe following distance at all times to prevent rear-end collisions.

  9. Treat unexplained slow driving or frequent braking as a risk. Increase distance and overtake when safe.

  10. Watch for sudden crossings, especially on roads without crosswalks. Cover the brake upon seeing pedestrians and be ready to stop. The object must have a single key " evaluations", which is a list. Each item in the list must adhere to the following structure: 1. A “rule” key with the original rule text.

-“positive_evidence”: Visual cues that the rule is being followed. State “None " if none.

-“negative_evidence”: Visual cues that the rule is being violated. State “None " if none.

-“risk_level”: A string describing the risk for violations (“Negligible”, “Low “, “Moderate”, “High”). For nonviolations, state “N/A”.

-“conclusion”: A summary of your 4step analysis, explicitly stating the final judgment (e.g., “Not Applicable”, “Perfect Adherence”, “Moderate Risk Violation”). 3. A “score” key, a float number that is a direct consequence of your ‘conclusion’ and ‘risk_level’.

Example of your required output format: ‘‘‘json {{ “scene_description”: “A vehicle is approaching an intersection with a pedestrian waiting to cross at a marked crosswalk. The traffic light is green.”, “evaluations”: [ {{ “rule”: “Yield to pedestrians at the crosswalk.”, “reasoning”: {{ “positive_evidence”: “The vehicle is slowing down as it approaches the crosswalk.”, “negative_evidence”: “The vehicle does not come to a complete stop and proceeds through the crosswalk, forcing the pedestrian to wait.”, “risk_level”: “High”, “conclusion”: “High Risk Violation. The rule is applicable and evidence is clear. The vehicle failed to yield to a pedestrian ready to cross, creating immediate danger.” }}, “score”: -0.8 }}, {{ “rule”: “Obey traffic signals.”, “reasoning”: {{ “positive_evidence”: “The traffic light for the vehicle’s direction of travel is green, and the vehicle is proceeding through the intersection.”, “negative_evidence”: “None”, “risk_level”: “N/A”, “conclusion”: “Perfect Adherence. The rule is applicable, and the vehicle is correctly following the green light signal.” }}, “score”: 1.0 }}, {{ “rule”: “No Honking.”, “reasoning”: {{ “positive_evidence”: “None”, “negative_evidence”: “None”, “risk_level”: “N/A”, “conclusion”: “Applicable, but no evidence. The rule is relevant in a city scene, but there is no visual or contextual information to determine if the horn was used.” }}, “score”: 0.0 }}, {{ “rule”: “Stop at Railroad Crossing.”, “reasoning”: {{ “positive_evidence”: “None”, “negative_evidence”: “None”, “risk_level”: “N/A”, “conclusion”: “Not Applicable. There are no railway tracks or crossings visible in the scene.” }}, “score”: 1.0 }} ] }}””” Listing 1. The prompt used for constructing GT of the Value Model Dataset

—Goal—Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities. Use {language} as output language.

—Steps—1. Identify all entities. For each identified entity, extract the following information: -entity_name: Name of the entity, use same language as input text. If English, capitalized the name.

-entity_type: One of the following types:

[{entity_types}] -entity_description: Comprehensive description of the entity’s attributes and activities Format each entity as (“entity”{ tuple_delimiter}<entity_name>{ tuple_delimiter}<entity_type>{ tuple_delimiter}<entity_description>)

  1. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are * clearly related * to each other. For each pair of related entities, extract the following information: -source_entity: name of the source entity, as identified in step 1 -target_entity: name of the target entity, as identified in step 1 -relationship_description: explanation as to why you think the source entity and the target entity are related to each other -relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity -relationship_keywords: one or more highlevel key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details Format each relationship as (“relationship “{tuple_delimiter}<source_entity>{ tuple_delimiter}<target_entity>{ tuple_delimiter}< relationship_description>{ tuple_delimiter}<relationship_keywords>{ tuple_delimiter}<relationship_strength>)

  2. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document. Format the content-level key words as (” content_keywords”{tuple_delimiter}< high_level_keywords>)

  3. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use ** {record_delimiter} ** as the list delimiter.

Some samples from the value model dataset are shown in Figure 7. We will make the entire dataset and the model publicly available in the future.

Closed-loop Evaluation on Bench2Drive. As shown in Table2, KnowVal achieves a further 3.35 improvement in Driving Score and a 1.76% gain in Success Rate over the VLA-based method SimLingo, establishing new state-of-

Closed-loop Evaluation on Bench2Drive. As shown in Table2

Closed-loop Evaluation on Bench2Drive. As shown in Table

, integrating the knowledge retrieval and value model compo-C8S1A1: Drive slowly through puddles to avoid splashing others. Keep a safe distance and avoid sudden maneuvers. Stay in the lane center. C3S3A3: Slow down for pedestrians/cyclists on roads with puddles, mud, or gravel. C3S1: Emergency Lane: Do not drive or park in the emergency lane. C1S1A1: Obey traffic police over signals. With no signals, proceed with caution. C5S4: Icy & Snowy Roads C5S4A6: Follow existing road tracks (ruts). If none, use road edges or landmarks to navigate. C8S1: Respecting Pedestrians C1S1A2: Know your vehicle; read the owner’s manual. C3S3: Yielding to Pedestrians & Non-Motorized Vehicles C5S3A2: On winter bridges, reduce speed early. Watch for ice and avoid hard braking.Input ### Scene Analysis ### Instances Car (4.5x1.9m): 39.8m ahead, 9.2m left. Car (4.4x1.8m): 34.6m ahead, 8.2m left. Pedestrian: 2.8m ahead, 0.4m right. Pedestrian: 28.3m ahead, 5.1m rightโ€ฆ Occupancy Drivable Area: Ahead-center. Drivable Area: 17.8m ahead, 8.2m left. Drivable Area: 35.8m behind, 38.6m right. Sidewalk: 10.2m ahead, 9.4m left.

, integrating the knowledge retrieval and value model compo-C8S1A1: Drive slowly through puddles to avoid splashing others. Keep a safe distance and avoid sudden maneuvers. Stay in the lane center. C3S3A3: Slow down for pedestrians/cyclists on roads with puddles, mud, or gravel. C3S1: Emergency Lane: Do not drive or park in the emergency lane. C1S1A1: Obey traffic police over signals. With no signals, proceed with caution. C5S4: Icy & Snowy Roads C5S4A6: Follow existing road tracks (ruts). If none, use road edges or landmarks to navigate. C8S1: Respecting Pedestrians C1S1A2: Know your vehicle; read the owner’s manual. C3S3: Yielding to Pedestrians & Non-Motorized Vehicles C5S3A2: On winter bridges, reduce speed early. Watch for ice and avoid hard braking.

RgP: using retrieval results to guide supplementary perception. NK : the number of retrieved knowledge entries. WK : applying weighting to the total score. NT : the number of candidate trajectories. K&V OW RgP N K W K N T L2(m)โ†“ Colโ†“

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut