심볼릭 드라이브 로컬 퍼스트 자율주행 데이터 마이닝 프레임워크

Reading time: 5 minute
...

📝 Abstract

The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of “Long-Tail” training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neurosymbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a “System 2” inference-time alignment strategy, utilizing a multi-model “Judge-Scout” consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% compared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.

💡 Analysis

The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of “Long-Tail” training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neurosymbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a “System 2” inference-time alignment strategy, utilizing a multi-model “Judge-Scout” consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% compared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.

📄 Content

Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus Antonio Guillen-Perez Independent Researcher antonio_algaida@hotmail.com antonioalgaida.github.io Abstract The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of “Long-Tail” training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro- symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a “System 2” inference-time alignment strategy, utilizing a multi-model “Judge- Scout” consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% compared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud. 1 Introduction The fundamental challenge in scaling autonomous perception is the imbalanced distribution of training data. As illustrated in Figure 1, driving scenarios follow a heavy-tailed (Zipfian) distribution. The “Head” of the distribution comprises the vast majority of collected logs (≈99%), representing nominal driving conditions such as highway cruising or stopped traffic. While abundant, this data offers diminishing returns for improving model robustness. The critical value for Level 4 safety validation lies in the “Long Tail” rare, high-entropy events such as construction zones with conflicting lane markings, erratic vulnerable road users (VRUs), or sensor degradation due to sudden weather changes Caesar et al. [2019]. Currently, identifying these samples within petabyte-scale “Data Lakes” constitutes a “Dark Data” crisis. Manual review is cost-prohibitive at this scale, and heuristic metadata tags (e.g., weather=rain) lack the semantic granularity to distinguish between a wet road and a dangerous hydroplaning risk. Currently, mining these safety-critical scenarios from archival footage is a bottleneck. Traditional methods rely on brittle heuristics (e.g., querying CAN bus data for hard braking) or metadata keyword search, which suffers from poor temporal granularity. While recent Vision-Language Models (VLMs) like GPT-4V offer promising semantic understanding, relying on closed-source cloud APIs for data curation is impractical for the automotive industry due to strict data privacy regulations (GDPR), bandwidth constraints, and the prohibitive cost of processing video streams at scale. Under Review arXiv:2512.12012v2 [cs.CV] 16 Dec 2025 Frequency (Log Scale) Scenario Complexity / Entropy “The Head” Nominal Driving (~99% of Data) • Highway Cruising • Clear Weather • Stopped at Red Light “The Long-Tail” (Dark Data) Safety-Critical Edge Cases (<1% of Data, High Value) Construction Lane Diversions VRU Interaction Erratic/Hesitant Sensor Fidelity Glare/Droplets Current Human Labeling Limit Target: Semantic-Drive Mining Figure 1: The “Dark Data” Crisis in Autonomous Driving. The distribution of driving scenarios follows a Power Law (Zipfian) distribution. (Left) The “Head”: Represents 99% of data logs, consisting of nominal, low-entropy driving (e.g., highway cruising) which provides diminishing returns for model training. (Right) The “Long Tail”: Contains rare, safety-critical edge cases defined by the Waymo Open Dataset (WOD-E2E) taxonomy, such as erratic VRUs or sensor degradation. Traditional human annotation is cost-prohibitive for mining this region. Semantic-Drive automates the retrieval of these high-value samples. To bridge this gap, we introduce Semantic-Drive, a privacy-preserving, local-first framework for semantic data mining. Unlike end-to-end driving agents (e.g., DriveGPT4 Xu et al. [2024]) that utilize VLMs for control, Semantic-Drive focuses on retrieval, acting as a “Cognitive Indexer” that transforms raw, unstructured video logs into a queryable semantic database. Semantic-Drive is a novel Neuro-Symbolic Architecture designed to run efficiently on consumer- grade hardware (e.g., a single NVIDIA RTX 3090). Pure VLMs often suffer from hallucination and “small object blindness.” To mitigate this, our framework separates perception into two distinct pathways: (1) A symbolic “Grounding” stage using real-time Open-Vocabulary Object Detection to generate a high-recall inventory of hazards, and (2) A cognitive “Reasoning” stage whe

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut