CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving

CRASH: Cognitiv e Reasoning Agen t for Safet y Hazards in Autonomous Driving Eric k Silv a 1 [0009 − 0003 − 7352 − 039 X ] , Rehana Y asmin 1 [0009 − 0006 − 9039 − 0648] , and Ali Shok er 1 [0000 − 0002 − 4898 − 9394] King Abdullah Universit y of Science and T echnology , Th uw al, Saudi Arabia {firstname.lastname}@kaust.edu.sa Pr eprint — Curr ently Under R eview Abstract. As A V s gro w in complexity and div ersity , iden tifying the ro ot causes of op erational failures has b ecome increasingly complex. The het- erogeneit y of system arc hitectures across manufacturers, ranging from end-to-end to modular designs, together with v ariations in algorithms and in tegration strategies, limits the standardization of incident inv es- tigations and hinders systematic safety analysis. This w ork examines real-w orld A V incidents rep orted in the NHTSA database. W e curate a dataset of 2,168 cases rep orted b et w een 2021 and 2025, representing more than 80 million miles driven. T o pro cess this data, we introduce CRASH, Cognitiv e Reasoning Agen t for Safety Hazards, an LLM-based agen t that automates reasoning ov er crash rep orts by leveraging b oth standardized ﬁelds and unstructured narrative descriptions. CRASH op- erates on a uniﬁed representation of each incident to generate concise summaries, attribute a primary cause, and assess whether the A V ma- terially contributed to the even t. Our ﬁndings sho w that (1) CRASH attributes 64% of incidents to p erception or planning failures, under- scoring the importance of reasoning-based analysis for accurate fault attribution; and (2) approximately 50% of rep orted inciden ts in volv e rear-end collisions, highligh ting a persistent and unresolv ed c hallenge in autonomous driving deplo yment. W e further v alidate CRASH with ﬁv e domain exp erts, achieving 86% accuracy in attributing A V system failures. Ov erall, CRASH demonstrates strong p oten tial as a scalable and interpretable to ol for automated crash analysis, providing action- able insights to supp ort safety research and the contin ued developmen t of autonomous driving systems. Keyw ords: Safety · LLM · Autonomous V ehicles · ADS · ADAS. 1 In tro duction The safet y premise of Autonomous V ehicles (A V s) remains far oﬀ, giv en the increasing num b er of failures and fatal incidents. Although simulation and for- mal safety arc hitectures aim to preven t incidents, rare or catastrophic even ts 2 S. Eric k et al. can undermine public conﬁdence, ero de trust [ 26 ], and prov ok e debates o ver ac- coun tability and liability [ 10 , 18 ]. Reﬂections on rep eated A V missteps w arn that without structured learning from past inciden ts, the industry risks rep eating fail- ures [ 12 ]. Systematic narrative analysis is therefore essential, b oth for technical impro vemen t and for restoring public trust. Meanwhile, research on A V crash causalit y , suc h as analyses of disengagements and collision classiﬁcations, has iden tiﬁed k ey failure mo des, including challenges related to road surface condi- tions [ 7 , 13 ]. How ever, these studies often rely on human-in tensiv e analysis and sp ecialist exp ertise, which limits their scalability and depth. Emerging to ols such as Large Language Models (LLMs) hav e the p oten tial to transform the analysis of large-scale, unstructured narrativ es related to safety-critical scenarios [ 23 ], pro vided their inheren t biases are carefully addressed. This pap er in tro duces CRASH, a reasoning-cen tric LLM-based agen t for scal- able analysis of A V s’ inciden t reports. As A V inciden t databases con tin ue to gro w in size and complexity , traditional exp ert-driv en analysis do es not scale: man ual review is time-consuming, diﬃcult to standardize, and limits the abil- it y to extract cross-incident insights. Rather than replacing h uman exp ertise, CRASH reframes the analysis w orkﬂow by shifting the primary burden of struc- tured reasoning and syn thesis to an LLM-based agent, while retaining humans as review ers, v alidators, and decision-mak ers. In doing so, w e demonstrate a new interaction paradigm with large, text-hea vy incident rep ort databases that enables systematic, interpretable, and eﬃcient safet y analysis. Our contributions are summarized as follo ws: – A reasoning-driven analysis metho dology that op erationalizes exp ert safety reasoning in to a structured, m ulti-step LLM pipeline, enabling consisten t in terpretation of heterogeneous incident narratives rather than prop osing a new data format. – A causal attribution agent that p erforms structured decomp osition of inci- den t rep orts b y assigning primary causes, iden tifying failed A V subsystems, detecting dela yed AI p erception or resp onse, and generating concise, h uman- readable summaries to supp ort exp ert review. – A taxonomy-guided analysis framework coupled with automatic aggregation of mo del outputs, enabling the disco v ery of recurring A V failure patterns and providing data-driv en insigh ts to inform safer system design and p olicy discussions. By leveraging automated narrative reasoning, we aim to help the A V ﬁeld b ecome more robust, not only by engineering safer systems but also by institu- tionalizing learning from failures. The remainder of this paper is organized as follo ws. Section 2 reviews related work; Section 3 des cribes our data pro cessing pip eline and details the arc hitecture of our LLM-based agent; Section 4 presents ev aluation results and it is follo wed by our ﬁndings on the dataset analysis in Section 5 ; and Section 6 concludes with lessons and future directions. CRASH: Cognitiv e Reasoning Agent for Safety Hazards 3 2 Related W orks 2.1 Early A V crash analysis Early A V crash analyses (2014–2021) relied on structured, rep ort-lev el statisti- cal aggregation of California DMV data to characterize disengagemen t causes, collision types, and attributed fault [ 1 – 3 , 7 , 11 , 13 ]. While eﬀectiv e for trend iden- tiﬁcation, these approaches op erate on predeﬁned categorical lab els without sys- tematically insp ecting the underlying narrative d escriptions for causalit y chains or system-lev el interactions, a distinction cen tral to our metho dological con tri- bution. 2.2 Qualitativ e analysis of rep orts In contrast to purely statistical analyses, works suc h as [ 21 , 22 ] conducted qual- itativ e examinations of individual crash cases, man ually analyzing narrativ e de- scriptions to extract contributing factors. By design, these studies rely on close reading and exp ert interpretation of a limited n um b er of rep orts, enabling richer con textual insights but restricting scalability and repro ducibilit y . Visualization platforms suc h as [ 4 ] further support exploratory qualitativ e analysis b y cu- rating narratives into searchable and in teractive formats. How ev er, these to ols remain centered on manual insp ection rather than automated, systematic nar- rativ e pro cessing. These eﬀorts demonstrate the v alue of unstructured crash narrativ es. Nevertheless, they lack a scalable framework for extracting and or- ganizing causal and system-level information across large corp ora, an explicit metho dological gap addressed in our wor k. 2.3 T ext classiﬁcation and summarization using NLP Natural language pro cessing (NLP) has increasingly b een applied to large-scale transp ortation safet y rep orts to mov e b ey ond purely man ual or aggregate analy- ses. F or example, [ 28 ] prop osed an NLP-based pip eline for DMV disengagemen t rep orts (2014–2020), automatically lab eling disengagement causes using prede- ﬁned taxonomies and a sup ervised classiﬁcation arc hitecture. Their metho dology formalizes narrativ e pro cessing but remains tied to explicit lab el engineering and ﬁxed output categories. Beyond transp ortation, [ 8 ] demonstrates that ground- ing LLM prompts in structured contextual signals improv es ro ot-cause recom- mendations and incident classiﬁcation, highlighting the imp ortance of domain- sp eciﬁc con textualization for reliable large-scale reasoning. Similarly , [ 27 ] b enc h- mark ed transformer-based narrative-mining approac hes on Kentuc ky p olice re- p orts (2015–2022), comparing ﬁne-tuned models suc h as RoBER T a [ 14 ] with zero-shot LLMs including DeepSeek-R1:70B [ 9 ]. Their ev aluation emphasizes p erformance–cost trade-oﬀs and th e role of mo del selection in scalable deploy- men t. Collectiv ely , these works formalize narrative analysis through sup ervised learn- ing, ﬁne-tuning, or predeﬁned classiﬁcation schemes. Ho wev er, they primarily 4 S. Eric k et al. frame the task as lab el prediction or sev erity estimation, rather than struc- tured reasoning o v er causal c hains and m ulti-stage system interactions. Our w ork builds on this metho dological progression by treating crash narratives as inputs to a reasoning-oriented agen t rather than solely a classiﬁcation task. Using a more recent and comprehensive dataset (2021–2025), we generalize narrative- based analysis at scale without relying on rigid, manually engineered lab el sets. This enables systematic extraction of system-level failure patterns, bridging the gap b et ween descriptiv e statistics, case-by-case qualitativ e analysis, and conv en- tional sup ervised NLP pip elines. T able 1 compares selected prior studies and CRASH across ﬁve metho dological dimensions explicitly reﬂected in the table: use of statistical analysis, exp ert-deﬁned rules, N LP/LLM tec hniques, scalabil- it y , and system-level reasoning. The comparison highligh ts metho dological diﬀer- ences in how incident data are analyzed, and whether approac hes mov e b ey ond descriptiv e statistics to ward automated, structured reasoning. T able 1: Metho dological comparison of selected prior work and CRASH. W ork Stats Exp ert Rules NLP/LLM Scalable Sys. Reason F av aro et al. [ 7 ] ✓ – – – – Houseal et al. [ 11 ] ✓ – – – – Shah et al. [ 21 ] – ✓ – – ✓ Zhang et al. [ 27 ] ✓ – ✓ ✓ – CRASH (Ours) ✓ ✓ ✓ ✓ ✓ 3 CRASH: Cognitiv e Reasoning Agen t for Safet y Hazards T o address the gaps identiﬁed in Section 2 , namely the limited scalabilit y of h uman-driven analysis and the absence of systematic causal reasoning ov er nar- rativ e reports, we introduce the CRASH agen t architecture. Rather than treating the LLM as an isolated reasoning comp onen t, CRASH is des igned as a mo du- lar, repro ducible pro cessing pip eline that ensures traceability from raw rep orts to insigh ts in to data distribution and simulation-ready outputs. The arc hitec- ture separates data conditioning, language-based reasoning, and structured in- terpretation in to distinct stages to av oid black-box b eha vior and preserve ana- lytical rigor. The prop osed CRASH arc hitecture, depicted in Fig. 1 , consists of three main mo dules: Prepro cessing , where w e digest and ﬁlter the incidents database; Processing , where the ﬁltered data are provided to the LLM for structured causal extraction; and P ostpro cessing , where we analyze the re- sulting data distributions, LLM outputs for alignment, and generate structured inputs for sim ulation tools to recreate incident scenarios. The workﬂo w of these mo dules is describ ed in detail b elo w. CRASH: Cognitiv e Reasoning Agent for Safety Hazards 5 Fig. 1: CRASH Architecture. During prepro cessing, the database is ﬁltered for incomplete en tries and uniﬁed in to four columns. Then, eac h row is sen t to Pro cessing through the LLM. W e ﬁnally aggregate all information into a CSV ﬁle and send it to Postprocessing, where the data is used to generate our analysis and simulation descriptions. 3.1 Prepro cessing Dataset and Filtering CRASH is designed to operate on an y structured inci- den t rep ort database that provides standardized metadata ﬁelds alongside free- text narrative descriptions. F or this study , w e instan tiate the pip eline on the National High wa y T raﬃc Safety Administration (NHTSA) database [ 16 ]. This crash rep orting program provides a comprehensiv e, m ulti-man ufacturer, and ge- ographically diverse collection of ADS and ADAS inciden ts. Alternativ e sources, suc h as volun tary ﬂeet summaries [ 15 ] or state-lev el ADS registries [ 3 ], tend to be either se lf- rep orted and ﬂeet-sp eciﬁc or limited to a single region, making them less suited for uncov ering generalizable system-level trends. The distribution of inciden ts by rep orting entit y for the NHTSA dataset is shown in T able 2 . The original data, distributed across separate CSV ﬁles, was merged in to a single dataset. Eac h report w as restructured in to four columns: R ep ort ID (unique identiﬁer), R ep orting Entity/Make (categorical feature), and a F ul l T ext ﬁeld that concatenates all remaining structured metadata with the original nar- rativ e, providing the LLM a uniﬁed context window p er incident. En tries with redacted or missing narrativ es were ﬁltered out; pre- and post-ﬁltering distribu- tions are shown in T able 3 . 6 S. Eric k et al. T able 2: Distribution of Inciden t Cases by Rep orting En tity Mo del Mak e Cases % Jaguar I-P ace W aymo LLC 1015 46.83% Cruise A v GM LLC / Cruise LLC 355 16.37% Jaguar I-P ace T ransdev 172 7.93% Honda Civic Honda 113 5.21% T oy ota Highlander Zo o x, Inc. 88 4.06% Others V arious 425 19.60% T otal 2168 100.00% T able 3: Case distribution b efore and after ﬁltering Dataset Category Original Cases Final Cases ADS 1790 1764 AD AS 2582 352 Others 3520 52 T otal 7892 2168 3.2 Pro cessing LLM prompt construction As sho wn in Fig. 1 , our data pro cessing tec hnique comprises a Prompting Script, written in Python, that in terfaces with the LLM. The prompt w as developed using Prompt Engineering and In-Context L e arning (ICL) [ 6 ], a paradigm in which the mo del is provided with a sp eciﬁc p ersona, a constrained rule set, and illustrativ e examples to guide its output. Preliminary testing sho wed that this approach outp erformed the Chain-of-Thought (CoT) metho d [ 25 ]. While CoT encourages “step-by-step” reasoning, it signiﬁcantly in- creased output tok en length and frequen tly caused the model to deviate from the required JSON schema due to context-windo w b ottlenec ks and “hallucinated” con versational ﬁller. The ﬁnalized prompt design, detailed in Fig. 2 , fav ors con- strained classiﬁcation ov er op en-ended generation to maximize reliability across large datasets. By explicitly deﬁning the “Rules for A V F ailed” we inject domain- sp eciﬁc expert knowledge directly into the mo del’s inference path, preven ting the LLM from relying solely on its internal, and potentially biased, training weigh ts regarding accident liabilit y . The decision to use one-shot examples w as made to anc hor the mo del’s understanding of the short-hand coding system (e.g., PE , PL ), whic h serves as a compression tec hnique to reduce latency and cost. Alternatives suc h as Fine-T uning were considered but ultimately rejected due to the “black- b o x” nature of w eights and the high computational cost of retraining when new A V failure mo des are identiﬁed. Similarly , Zero-Shot prompting was found insuf- ﬁcien t for the tec hnical nuances of this task, as the mo del o ccasionally confused “stationary” rear-endings with “active” contributions without explicit guidance. CRASH: Cognitiv e Reasoning Agent for Safety Hazards 7 System Prompt: CRASH Agent Role: Y ou are an autonomous v ehicle (A V) incident analyst. Perform all rea- soning in ternally and only output the ﬁnal structured result. T asks – 1. Decide if A V contributed. 2. Select primary cause. 3. Iden tify failed system (if S). 4. Check if AI resp onse w as late. 5. Assign secondary cause. Rules for A V F ailed – Mo ving A V action con tributed → Y – P arked/Stationary rear-ended → N (unless av oidable → Y ) – Dela yed detection/reaction → Y and Late AI = true – Insuﬃcien t info → I Causes: S (Sys), H (Hum), E (En v), N (None) Systems: PE (Perc), PL (Plan), CO (Control), SW , HW , HA , N Secondary Cause Rules – Pro vide only if multiple factors; Must diﬀer from primary ( S, H, E, N ) Examples Ex 1: A V rear-ended while stopped at a red light. Output: {"AV_Failed": "N", "Cause": "H", "System": "N", "Late": false} Ex 2: A V failed to detect pedestrian; emergency braking engaged 0.5s after impact. Output: {"AV_Failed": "Y", "Cause": "S", "System": "PE", "Late": true} Fig. 2: System prompt for the CRASH Agent, incorp orating heuristic rules and one-shot examples. CRASH T axonomy of A V Incident Causes T o enable consistent causal attribution and do wnstream sim ulation reconstruction, we dev elop a structured taxonom y and standardized output format tailored to A V incident analysis. Ex- isting taxonomies either fo cus narrowly on disengagement rep orting or empha- size high-level b eha vioral abstractions, limiting their suitability for system-level reasoning and sim ulation-ready representations [ 20 , 28 ]. In particular, prior cat- egorizations often center on the Sense–Plan–Act (SP A) decomp osition without explicitly mo deling AI-speciﬁc failure mo des, cross-mo dule in teractions, or envi- ronmen tal and human co-factors. Our taxonomy extends b ey ond SP A b y orga- nizing causes in to three uniﬁed categories: System F ailures, Human F actors, and En vironmental Conditions; as depicted in Fig. 3 . System failures span p ercep- tion, prediction, planning/control errors, softw are faults, latency , dela yed han- do ver, and hardware and communication issues. Human factors co ver b oth A V op erators (e.g., inattention, premature interv en tion) and other road users (e.g., rec kless driving). En vironmental conditions include adverse roadwa y , w eather, and complex traﬃc scenarios. Inciden ts may in volv e multiple interacting factors; this study emphasizes primary cause attribution, though secondary causes are also captured. 8 S. Eric k et al. Fig. 3: Compact taxonomy of A V incident causes. Cho osing the LLM mo del W e ev aluated mo dels from the Qwen3 and De epSe ek families [ 9 , 19 ], deploy ed lo cally via Ol lama [ 24 ], seeking a mo del whose context windo w accommo dates the full prompt, inciden t text, and structured resp onse while consistently pro ducing v alid JSON. Deco ding was ﬁxed at temperature = 0 and top_p = 1 for determinism. Mo dels with few er than 14B parameters pro ved unreliable for structured output; 7B v arian ts required multi-generation v oting, matching the latency of a single 32B run. The b est conﬁguration was De epSe ek-R1 (32B, Q4_K_M quan tization) deploy ed lo cally through Ollama, a veraging ∼ 30 seconds per case. Lo cal deploymen t eliminates API costs and aids repro ducibilit y; the pip eline remains model-agnostic and extensible to cloud- hosted mo dels. 3.3 P ostpro cessing and Human-in-the-Lo op V alidation The p ostpro cessing stage serv es a dual role: structuring causal distributions for quan titative analysis and assessing output quality through exp ert alignment. The standardized taxonom y enables reproducible aggregation and traceable map- ping from narratives to structured causal represen tations. W e also incorp orate a human-in-the-loop v alidation proto col: domain researchers ev aluated correct- ness, clarit y , and causal consistency of outputs via a structured surv ey , and their aggregated feedbac k w as used to iterativ ely reﬁne prompt instructions, output constrain ts, and attribution guidelines. This light weigh t alignmen t mech- anism [ 5 , 17 ] improv es consistency and domain ﬁdelity without up dating mo del w eights. 4 Ev aluation W e ev aluate CRASH as a system-level analysis pip eline to determine whether structured reasoning ov er unstructured crash narrativ es can b e p erformed reli- ably , repro ducibly , and at scale. The ev aluation is structured across three pri- mary axes: 1. System reliabilit y: V alidating the consistency and formatting stability of the LLM outputs (Sec. 4.1 ). 2. Exp ert agreement: Establishing a qualitative ground truth across 50 canon- ical inciden ts to measure reasoning accuracy and comparing against tradi- tional NLP baseline heuristics (Sec. 4.2 ). CRASH: Cognitiv e Reasoning Agent for Safety Hazards 9 3. Run time eﬃciency: Assessing computational inference b ottlenec ks and scalabilit y compared to manual analysis (Sec. 4.3 ). 4.1 System Reliability F rom a systems p erspective, CRASH demonstrates stable , structured generation and deterministic b eha vior under ﬁxed deco ding parameters. JSON outputs re- main consisten t, with formatting failures o ccurring in only 2% of cases; these w ere automatically corrected through a retry mechanism. 4.2 Exp ert Agreemen t T o assess output quality , ﬁv e researchers from our group independently review ed 50 representativ e cases sampled from the dataset. All ev aluators actively work on automotiv e systems and collectively bring decades of academic and indus- try exp erience in autonomous driving and in telligent transp ortation. Each case receiv ed t wo indep enden t ev aluations through a 10-case o verlap betw een review- ers. Ev aluators relied exclusively on the F ul l T ext ﬁeld of each rep ort, without external knowledge ab out man ufacturers or prior even ts. They assessed four di- mensions: A V resp onsibilit y , late AI resp onse, primary cause, and failed subsys- tem. F or each dimension, they assigned one of three labels: Corr e ct , Inc orr e ct , or Insuﬃcient Context . T o ev aluate accuracy against these sub jective human judg- men ts, we derived a lenient, pro xy “gold lab el” dataset. If at least one h uman ev aluator deemed the mo del’s output Corr e ct , that output string was accepted as the gold lab el for scoring purp oses. If all ev aluating researchers agreed that pro viding a deﬁnitive answer w as imp ossible, it was marked Insuﬃcient Context . This criterion reﬂects realistic in vestigativ e settings, where inciden t narratives naturally contain incomplete or ambiguous information. Under this derived gold standard, CRASH achiev es 86% accuracy for A V resp onsibilit y , 84% for late AI detection, 76% for primary cause attribution, and 46% for failed subsystem identiﬁcation. The results indicate strong alignment with exp ert interpretation on high-level causal dimensions. At the same time, subsystem classiﬁcation remains more c hallenging b ecause p erception, planning, and control failures often interact and cannot b e cleanly separated in narrative rep orts. Human Agreement and Dataset Am biguit y Review er agreement ranged from 53–67% across dimensions, highligh ting the inheren t ambiguit y of incident narrativ es. Ev aluators frequen tly assigned Insuﬃcient Context labels, particu- larly for subsystem attribution, where up to 68% of cases lac ked enough informa- tion for a deﬁnitive judgment. This observ ation reﬂects a structural limitation of real-w orld crash reports: descriptions often omit timing details, sensor behavior, and internal autonom y stack decisions. Consequently , subsystem attribution re- quires stronger evidence than higher-level causal interpretation, explaining the lo wer accuracy observed for this dimension. 10 S. Eric k et al. T able 4: Baseline comparison on the 50 expert-ev aluated cases (lenien t criterion). Metho d A V F ail Late AI Cause Sys. F ail Ma jority class 54% 34% 42% 22% Keyw ord rules 48% 46% 44% 18% CRASH (ours) 86% 84% 76% 46% Baseline Comparison T o demonstrate whether CRASH genuinely p erforms con textual reasoning or mimics simpler statistical priors, we ev aluate the system against tw o reference baselines mapp ed to the traditional NLP metho dologies iden tiﬁed in Section 2 : 1. Ma jority Class Data (Descriptive Statistics): A naïve statistical pre- dictor that blindly outputs the most frequen t lab el observed across the entire 2,168-case dataset (e.g., alwa ys predicting A V F aile d: Y es or Primary Cause: System ). 2. Keyw ord Rules (Man ual Extraction): A deterministic system that relies on robust Regular Expression heuristics searching the rep ort’s F ul l T ext . F or instance, ﬁnding “ADAS engaged” asserts A V failure, while counting terms lik e “rain” or “weather” asserts an en vironmental cause. T o ensure fairness, b oth baselines were scored against the same proxy gold lab els generated from the human ev aluation. If reviewers deemed a rep ort to ha ve Insuﬃcient Context , baseline answers automatically scored zero, penaliz- ing them equally alongside CRASH for guessing unansw erable queries. Similarly , if human reviewers marked CRASH’s output as incorrect but lack ed suﬃcient detail to formulate an opp osing ground truth label, baseline predictions w ere conserv ativ ely p enalized. This arc hitecture mirrors the absolute constraints un- der whic h CRASH was ev aluated without artiﬁcially inﬂating baseline diﬃculty . T able 4 contextualizes the resulting performance. CRASH consistently outp erforms b oth baselines across all dimensions. Rel- ativ e to the strongest baseline for eac h task, the system reduces classiﬁcation error b y appro ximately 70% for A V resp onsibilit y and late AI detection, 57% for primary cause attribution, and 31% for subsystem identiﬁcation. Because the ma jority predictor captures dataset priors and the k eyword heuristic reﬂects simple lexical pattern matching, the improv ement ov er b oth baselines suggests that CRASH p erforms structured reasoning o ver inciden t narrativ es rather than relying solely on surface cues. T aken together, these results show that CRASH ac hieves strong alignment with expert in terpretation on high-level causal dimen- sions while remaining appropriately conserv ative when rep orts lac k suﬃcien t tec hnical detail. This behavior suggests that the system captures meaningful causal structure in crash narrativ es rather than relying solely on lexical pat- terns. CRASH: Cognitiv e Reasoning Agent for Safety Hazards 11 Fig. 4: Primary cause distribution and subsystem breakdown across 2,168 A V incidents. 4.3 Run time Eﬃciency Inference constitutes the primary computational bottleneck, with an a v erage pro cessing time of approximately 30 seconds p er rep ort on tw o NVIDIA A4500 GPUs (40GB VRAM). Even with this inference cost, the system pro cesses inci- den t rep orts substan tially faster than man ual expert review, whic h often requires sev eral minutes and repeated readings of the narrativ e. 5 Findings on AD Safety Inciden ts in NHTSA Data W e summarize in Fig. 4 the principal quantitativ e trends extracted from 2,168 NHTSA incident rep orts using CRASH. System-related failures, shown in tones of blue, dominate the dataset, accounting for 1,497 incidents (69%), far exceed- ing human factors (red, 27.3%) and environmen tal causes (green, 2.8%). This dominance is consistent with prior studies showing that perception errors and scene in terpretation remain ma jor c hallenges in complex urban en vironments. Within system-related inciden ts, p erception failures dominate (1,245 cases), fol- lo wed b y planning (149), control (49), and handov er (32). This trend is exp ected b ecause p erception mo dules m ust in terpret highly v ariable real-world scenes, in- cluding o cclusions, unusual vehicle b eha vior, and complex urban en vironments, whic h remain challenging ev en for mo dern sensor fusion pip elines. Hardw are and soft ware malfunctions occur only rarely , suggesting that most failures stem from algorithmic limitations in scene interpretation rather than physical comp onen t faults. 12 S. Eric k et al. Late AI resp onses, shown by the outer ring in the image, app ear in 57.1% of all reports and are strongly coupled with system failures: 79% of system-lev el failure cases o ccur alongside late resp onses. This observ ation is consistent with the tight real-time constraints of A V stacks, where p erception delays propagate to planning and con trol mo dules, reducing the av ailable reaction time. This pattern suggests that latency frequen tly ampliﬁes localized p erception or decision errors in to full collision outcomes, highligh ting timing as a structural constraint in A V safet y . Fig. 5: Overlap b et ween late AI behav- ior, A V failures, and rear-end collisions ( N =2 , 168 ). T o further examine timing-related fail- ures, we analyze ho w delay ed AI b eha v- ior intersects with A V failures and col- lision t yp es. Rear-end collisions accoun t for approximately half of all incidents in the dataset. Rear-end collisions are com- mon in mixed-autonomy environmen ts b e- cause A V s often adopt conserv ative brak- ing p olicies that ma y not align with the exp ectations of h uman driv ers follo w- ing b ehind. While prior studies often re- p ort that surrounding driv ers predomi- nan tly rear-end A V s, our structured anal- ysis identiﬁes 583 (26.8%) rear-end cases in which delay ed p erception or decision- making plausibly contributed to the out- come. As illustrated in Fig. 5 , delay ed de- tection or reaction substantially ov erlaps with b oth A V failures and rear-end colli- sions, reinforcing the interpretation of la- tency as an amplifying factor rather than an isolated fault category . Ov erall, these ﬁndings indicate that p erception limitations and timing-related degradation structurally dominate A V safet y inciden ts. The dataset-lev el trends complemen t the system-lev el ev aluation b y showing how structured reasoning at scale reveals recurring latency-driven constraints within the autonomy stack. 6 Conclusion This work presen ted CRASH, a reasoning-centered agent designed to structure and in terpret A V inciden t rep orts at scale . By decomp osing crash narrativ es into meaningful causal dimensions and aligning its outputs with exp ert judgment, CRASH enables transparent attribution of resp onsibilit y and consistent system- lev el analysis. Both qualitative and quan titative ev aluations sho w that the agen t reliably captures high-lev el causes while remaining cautious in am biguous cases. The human review process also highlighted the cognitive eﬀort required to man u- ally analyze inciden t rep orts, with ev aluators frequen tly rereading cases to reach consisten t conclusions. In contrast, CRASH pro cesses each report in roughly 30 CRASH: Cognitiv e Reasoning Agent for Safety Hazards 13 seconds on mo dest hardw are, pro ducing coheren t and con text-aw are reasoning in a fraction of the time. Bey ond eﬃciency gains, CRASH reveals important safet y patterns. Most notably , the prev alence of timing-related failures, suc h as rear-end collisions, suggests that perception and planning latency ma y constitute a critical, un- derexamined b ottlenec k in A V safety . By systematically identifying such cross- cutting trends, CRASH mov es incident analysis b ey ond descriptive summaries to ward actionable, system-lev el insigh ts. As with any LLM-based system, se- man tic hallucination remains a risk: the mo del ma y plausibly attribute causes not fully grounded in the narrative. The constrained output schema, domain- sp eciﬁc rules, and deterministic decoding mitigate this by limiting generative freedom, but systematic v alidation on larger annotated subsets is a priority for future w ork. Overall, this work demonstrates that LLM-based reasoning agents can meaningfully augment safety auditing in complex, text-intensiv e domains. By combining interpretabilit y , adaptabilit y , and scalable analysis, CRASH pro- vides a practical foundation for future researc h on automotiv e perception, safet y v alidation, and human–AI collab oration in safety-critical systems. References 1. Alam b eigi, H., McDonald, A.D., T ank asala, S.R.: Crash themes in automated ve- hicles: A topic mo deling analysis of the california departmen t of motor vehicles automated v ehicle crash database. arXiv preprint arXiv:2001.11087 (2020) 2. Banerjee, S.S., Jha, S., Cyriac, J., Kalbarczyk, Z.T., Iyer, R.K.: Hands oﬀ the wheel in autonomous v ehicles?: A systems p erspective on o ver a million miles of ﬁeld data. In: 2018 48th Ann u. IEEE/IFIP Int. Conf. Dep endable Syst. Net w. (DSN). pp. 586–597. IEEE (2018) 3. California Department of Motor V ehicles: California dm v: Autonomous vehicle col- lision reports (ccrs). https://data.ca.gov/dataset/ccrs (2025), accessed: 2025- 08-24 4. dopra vního výzkum u (CD V), C.: Autonomous v ehicle crashes: A v crashes. https: //www.avcrashes.net/ (2025), accessed August 24, 2025 5. Christiano, P .F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcemen t learning from human preferences. Adv ances in neural information pro cessing systems 30 (2017) 6. Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., W u, Z., Chang, B., et al.: A surv ey on in-context learning. In: Proceedings of the 2024 conference on empirical methods in natural language pro cessing. pp. 1107–1128 (2024) 7. F av arò, F., Eurich, S., Nader, N.: Autonomous vehicles’ disengagemen ts: T rends, triggers, and regulatory limitations. Acciden t Analysis & Preven tion 110 , 136–148 (2018) 8. Go el, D., Husain, F., Singh, A., Ghosh, S., Para yil, A., Bansal, C., Zhang, X., Ra jmohan, S.: X-lifecycle learning for cloud incident management using llms. In: Pro c. 32nd ACM In t. Conf. FSE (Companion). pp. 417–428 (2024) 9. Guo, D., Y ang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zh u, Q., Ma, S., W ang, P ., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capabilit y in llms via reinforcemen t learning. arXiv preprint arXiv:2501.12948 (2025) 14 S. Eric k et al. 10. Holland, J.C., Sargolzaei, A.: V eriﬁcation of autonomous vehicles: Scenario gener- ation based on real world accidents. In: 2020 SoutheastCon. vol. 2, pp. 1–7. IEEE (2020) 11. Houseal, L.A., Gaw eesh, S.M., Dadv ar, S., Ahmed, M.M.: Causes and eﬀects of autonomous vehicle ﬁeld test crashes and disengagements using exploratory fac- tor analysis, binary logistic regression, and decision trees. T ransp ortation research record 2676 (8), 571–586 (2022) 12. K o opman, P .: Lessons from the cruise robotaxi pedestrian dragging mishap. IEEE Reliabilit y Magazine 1 (3), 54–61 (2024) 13. Leilabadi, S.H., Schmidt, S.: In-depth analysis of autonomous vehicle collisions in california. In: 2019 IEEE Intelligen t T ransp ortation Systems Conference (ITSC). pp. 889–893. IEEE (2019) 14. Liu, Y., Ott, M., Goy al, N., Du, J., Joshi, M., Chen, D., Levy , O., Lewis, M., Zettlemo yer, L., Sto y anov, V.: Rob erta: A robustly optimized bert pretraining approac h. arXiv preprint arXiv:1907.11692 (2019) 15. LLC, W.: W a ymo safety and impact report. https://waymo.com/safety/impact/ (2025), accessed: 2025-08-24 16. National High wa y T raﬃc Safety A dministration (NHTSA): Standing gen- eral order on crash rep orting. https://www.nhtsa.gov/laws- regulations/ standing- general- order- crash- reporting (2025), accessed July 16, 2025 17. Ouy ang, L., W u, J., Jiang, X., Almeida, D., W ainwrigh t, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray , A., et al.: T raining language models to follow instructions with h uman feedbac k. Adv ances in neural information pro cessing sys- tems 35 , 27730–27744 (2022) 18. Pöllänen, E., Read, G.J., Lane, B.R., Thompson, J., Salmon, P .M.: Who is to blame for crashes inv olving autonomous v ehicles? exploring blame attribution across the road transport system. Ergonomics 63 (5), 525–537 (2020) 19. Qw en T eam: Qw en3: The third generation of the qw en language model series. https://github.com/QwenLM/Qwen3 (2025) 20. Saﬀary , M., Inampudi, N., Siegel, J.E.: Dev eloping a taxonomy of elemen ts adv er- sarial to autonomous v ehicles. arXiv preprint arXiv:2403.00136 (2024) 21. Shah, S.A.: Safe-A V: A fault toleran t safety architecture for autonomous vehicles. Ph.D. thesis, McMaster Univ ersity (2019) 22. Shok er, A., Y asmin, R., Estev es-V erissimo, P .: Wip: Sa vvy: a trust worth y au- tonomous v ehicles architecture. NDSS (2024) 23. Surampudi, Y.: Big Data Meets LLMs: A New Era of Incident Monitoring. Lib er- tatem Media Priv ate Limited (2024) 24. T eam, O.: Ollama. https://ollama.com/ (2025), accessed: 28-August-2025 25. W ei, J., W ang, X., Sc huurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-though t prompting elicits reasoning in large language mo dels. A dv ances in neural information processing systems 35 , 24824–24837 (2022) 26. Zhang, Q., W allbridge, C.D., Jones, D.M., Morgan, P .L.: Public perception of autonomous vehicle capability determines judgment of blame and trust in road traﬃc accidents. T ransp ortation Researc h P art A: P olicy and Practice 179 , 103887 (2024). https://doi.org/10.1016/j.tra.2023.103887 27. Zhang, X., Chen, M.: Improving crash data qualit y with large language mo d- els: Evidence from secondary crash narrativ es in ken tucky . arXiv preprint arXiv:2508.04399 (2025) 28. Zhang, Y., Y ang, X.J., Zhou, F.: Disengagemen t cause-and-eﬀect relationships ex- traction using an nlp pip eline. IEEE T ransactions on Intelligen t T ransp ortation Systems 23 (11), 21430–21439 (2022)

CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment