📝 Original Info
- Title: 심볼릭 드라이브 로컬 퍼스트 자율주행 데이터 마이닝 프레임워크
- ArXiv ID: 2512.12012
- Date: 2025-12-12
- Authors: Antonio Guillen-Perez
📝 Abstract
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neurosymbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% compared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
💡 Deep Analysis
Deep Dive into 심볼릭 드라이브 로컬 퍼스트 자율주행 데이터 마이닝 프레임워크.
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of “Long-Tail” training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neurosymbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a “System 2” inference-time alignment strategy, utilizing a multi-model “Judge-Scout” consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-
📄 Full Content
Semantic-Drive: Democratizing Long-Tail Data
Curation via Open-Vocabulary Grounding and
Neuro-Symbolic VLM Consensus
Antonio Guillen-Perez
Independent Researcher
antonio_algaida@hotmail.com
antonioalgaida.github.io
Abstract
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the
scarcity of "Long-Tail" training data. While fleets collect petabytes of video
logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction
diversions) remains a manual, cost-prohibitive process. Existing solutions rely on
coarse metadata search, which lacks precision, or cloud-based VLMs, which are
privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-
symbolic framework for semantic data mining. Our approach decouples perception
into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector
(YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM
that performs forensic scene analysis. To mitigate hallucination, we implement
a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-
Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the
Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of
0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% compared
to the best single scout models. The system runs entirely on consumer hardware
(NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
1
Introduction
The fundamental challenge in scaling autonomous perception is the imbalanced distribution of
training data. As illustrated in Figure 1, driving scenarios follow a heavy-tailed (Zipfian) distribution.
The "Head" of the distribution comprises the vast majority of collected logs (≈99%), representing
nominal driving conditions such as highway cruising or stopped traffic. While abundant, this data
offers diminishing returns for improving model robustness.
The critical value for Level 4 safety validation lies in the "Long Tail" rare, high-entropy events
such as construction zones with conflicting lane markings, erratic vulnerable road users (VRUs),
or sensor degradation due to sudden weather changes Caesar et al. [2019]. Currently, identifying
these samples within petabyte-scale "Data Lakes" constitutes a "Dark Data" crisis. Manual review is
cost-prohibitive at this scale, and heuristic metadata tags (e.g., weather=rain) lack the semantic
granularity to distinguish between a wet road and a dangerous hydroplaning risk.
Currently, mining these safety-critical scenarios from archival footage is a bottleneck. Traditional
methods rely on brittle heuristics (e.g., querying CAN bus data for hard braking) or metadata keyword
search, which suffers from poor temporal granularity. While recent Vision-Language Models (VLMs)
like GPT-4V offer promising semantic understanding, relying on closed-source cloud APIs for data
curation is impractical for the automotive industry due to strict data privacy regulations (GDPR),
bandwidth constraints, and the prohibitive cost of processing video streams at scale.
Under Review
arXiv:2512.12012v2 [cs.CV] 16 Dec 2025
Frequency (Log Scale)
Scenario Complexity / Entropy
"The Head"
Nominal Driving
(~99% of Data)
• Highway Cruising
• Clear Weather
• Stopped at Red Light
"The Long-Tail" (Dark Data)
Safety-Critical Edge Cases
(<1% of Data, High Value)
Construction
Lane Diversions
VRU Interaction
Erratic/Hesitant
Sensor Fidelity
Glare/Droplets
Current Human Labeling Limit
Target: Semantic-Drive Mining
Figure 1: The "Dark Data" Crisis in Autonomous Driving. The distribution of driving scenarios
follows a Power Law (Zipfian) distribution. (Left) The "Head": Represents 99% of data logs,
consisting of nominal, low-entropy driving (e.g., highway cruising) which provides diminishing
returns for model training. (Right) The "Long Tail": Contains rare, safety-critical edge cases defined
by the Waymo Open Dataset (WOD-E2E) taxonomy, such as erratic VRUs or sensor degradation.
Traditional human annotation is cost-prohibitive for mining this region. Semantic-Drive automates
the retrieval of these high-value samples.
To bridge this gap, we introduce Semantic-Drive, a privacy-preserving, local-first framework for
semantic data mining. Unlike end-to-end driving agents (e.g., DriveGPT4 Xu et al. [2024]) that
utilize VLMs for control, Semantic-Drive focuses on retrieval, acting as a "Cognitive Indexer" that
transforms raw, unstructured video logs into a queryable semantic database.
Semantic-Drive is a novel Neuro-Symbolic Architecture designed to run efficiently on consumer-
grade hardware (e.g., a single NVIDIA RTX 3090). Pure VLMs often suffer from hallucination
and "small object blindness." To mitigate this, our framework separates perception into two distinct
pathways: (1) A symbolic "Grounding" stage using real-time Open-Vocabulary Object Detection
to generate a high-recall inventory of hazards, and (2) A cognitive "Reasoning" stage whe
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.