TAU-R1: Visual Language Model for Traffic Anomaly Understanding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1

💡 Research Summary

The paper tackles the problem of Traffic Anomaly Understanding (TAU), which goes beyond simple anomaly detection to require a detailed description of what happened, why it happened, and which road users were involved. Existing video‑anomaly datasets focus on binary labels or coarse categories and are often collected from edited or web‑mined sources, limiting their usefulness for fine‑grained, real‑world roadside surveillance. To fill this gap, the authors introduce the Roundabout‑TAU benchmark, built from 342 real‑world roundabout video clips captured by 28 fixed cameras in collaboration with the City of Carmel, Indiana. The dataset includes 2,064 question‑answer pairs that cover five aspects: environment perception, object grounding, anomaly time window, anomaly reasoning, and anomaly description. Each clip is also assigned to one of four anomaly classes (no anomaly, direction/manoeuvre violation, near‑collision/collision, abnormal road use).

Based on this benchmark, the authors propose TAU‑R1, a two‑layer hierarchical vision‑language framework designed for edge deployment. The first layer is a lightweight VLM (≤8 B parameters) that performs coarse anomaly classification in real time, filtering out normal traffic streams. When an anomaly is detected, the raw clip and its predicted class are passed to the second layer, a larger VLM (≈30 B parameters) that generates a comprehensive event summary in natural language. The summary is required to cover four perspectives: (1) environment (time of day, weather, road topology), (2) object grounding (vehicle type, color, location), (3) event description (movements and interactions), and (4) event analysis (underlying cause).

Training proceeds in two stages. Stage 1 uses a decomposed‑QA enhanced supervised fine‑tuning (SFT) where the overall TAU task is split into five sub‑tasks corresponding to the five QA types. This forces the model to learn intermediate scene knowledge before attempting full‑sentence generation. Stage 2 introduces TAU‑GRPO, a domain‑specific extension of Generative Reward‑Based Policy Optimization. Two custom reward functions are defined: one for anomaly classification accuracy (weighted cross‑entropy to handle class imbalance) and one for summarization quality (a weighted combination of BLEU‑4, ROUGE‑L, METEOR and human‑rated coherence). The reinforcement learning step fine‑tunes the large VLM to better align its generation with the TAU objectives.

Experimental results show that the lightweight classifier achieves >94 % accuracy in distinguishing normal from anomalous clips and >92 % F1 on each of the three anomaly categories. The large summarizer improves BLEU‑4 by 12‑18 % over baseline VLMs and receives higher human scores for “understanding” and “explanation completeness.” The full pipeline runs at ~28 FPS on a modest GPU, demonstrating suitability for real‑time edge deployment.

In summary, the paper makes three major contributions: (1) the Roundabout‑TAU dataset, the first real‑world roadside traffic‑anomaly benchmark with multi‑aspect QA annotations; (2) the TAU‑R1 two‑layer framework that balances efficiency and reasoning depth; and (3) a two‑stage training regime combining decomposed‑QA SFT with TAU‑GRPO reinforcement learning. The work advances traffic safety systems by enabling not only rapid anomaly screening but also rich, explainable narratives that can support faster incident response and better traffic management. Future directions include expanding the dataset to other intersection types and further compressing the large VLM for even tighter edge constraints.

TAU-R1: Visual Language Model for Traffic Anomaly Understanding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment