CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis

Reading time: 6 minute
...

📝 Original Info

  • Title: CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis
  • ArXiv ID: 2512.18878
  • Date: 2025-12-21
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176\% improvement in crash localization, and a 40\% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at https://github.com/Liangkd/CrashChat.

💡 Deep Analysis

Deep Dive into CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis.

Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning withi

📄 Full Content

CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis Kaidi Liang1[0009-0001-9129-2744], Ke Li1[0009-0001-4958-3302], Xianbiao Hu2[0000-0002-0149-1847], and Ruwen Qin1[0000-0003-2656-8705] 1 Stony Brook University, Department of Civil Engineering, Stony Brook, NY 11794, USA {kaidi.liang,ke.li,ruwen.qin}@stonybrook.edu 2 The Pennsylvania State University, Department of Civil and Environmental Engineering, University Park, PA 16802-1408, USA xbhu@psu.edu Abstract. Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical re- quirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing mod- els, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demon- strate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of- the-art performance. It reaches near-perfect accuracy in crash recogni- tion, a 176% improvement in crash localization, and a 40% improve- ment in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content cover- age in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at https://github.com/Liangkd/CrashChat. Keywords: Multimodal Large Language Model · Video Understanding · Visual Risk Perception · Multitask Learning · Autonomous Driving. arXiv:2512.18878v1 [cs.CV] 21 Dec 2025 2 Liang et al. 1 Introduction Video data are increasingly recognized as a readily accessible and informative source for traffic crash analysis, especially for autonomous vehicles. For instance, retrieving and analyzing crash footage captured by in-vehicle cameras can pro- vide objective evidence about adverse events, facilitating timely and accurate crash reporting. Given the vast volume of captured video data and the complex- ity of crash events, automating crash-related video analysis has become a critical need. Specifically, a tool is needed to automatically identify anomalous segments from massive amounts of video data, ground them in their temporal evolution, and generate a semantic understanding of the events. Crash video analysis is inherently a multitask problem. The desired capabili- ties, including crash recognition, temporal grounding, and understanding, can be enabled by six core tasks, as illustrated in Fig. 1. Crash recognition determines whether the input video contains a crash event. Crash localization identifies the precise interval during which the crash occurred, while pre-crash localization further determines when visual cues first appeared, signaling that a crash is im- minent. Crash description provides a structured narrative of the event. Causal reasoning infers the underlying causes of the crash, and prevention reasoning identifies conditions or actions that could have avoided it. Fig. 1. CrashChat - a multitask multimodal large language model performing six core tasks in support of crash video analysis in a unified way CrashChat: MLLM for Crash Video Analysis 3 While focusing on different aspects of crash video analysis, the six tasks are intertwined. Crash recognition narrows down to the target segment, thereby en- hancing the efficiency of temporal grounding. Further unfolding the identified segment into normal, pre-crash, and crash phases assists in distinguishing causal actions from subsequent effects. Conversely, the improved understanding of crash events will further enhance crash recognition. Therefore, a multitask framework can perform better than its monotask counterparts. Positive outcomes from mul- titasking rely on a good understanding of the relationship among those tasks and appropriate designs of the training strategy [2]. Yet, this remains largely unex- plored for crash video analysis, where task interactions are complex. Multimodal large lang

…(Full text truncated)…

📸 Image Gallery

model_architecture_overview.png model_architecture_overview.webp model_comparison_plot_final.png model_comparison_plot_final.webp qualitative_analysis.png qualitative_analysis.webp six_task_illustration_300.png six_task_illustration_300.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut