ST-GraphNet: A Spatio-Temporal Graph Neural Network for Understanding and Predicting Automated Vehicle Crash Severity
Understanding the spatial and temporal dynamics of automated vehicle (AV) crash severity is critical for advancing urban mobility safety and infrastructure planning. In this work, we introduce ST-GraphNet, a spatio-temporal graph neural network framework designed to model and predict AV crash severity by using both fine-grained and region-aggregated spatial graphs. Using a balanced dataset of 2,352 real-world AV-related crash reports from Texas (2024), including geospatial coordinates, crash timestamps, SAE automation levels, and narrative descriptions, we construct two complementary graph representations: (1) a fine-grained graph with individual crash events as nodes, where edges are defined via spatio-temporal proximity; and (2) a coarse-grained graph where crashes are aggregated into Hexagonal Hierarchical Spatial Indexing (H3)-based spatial cells, connected through hexagonal adjacency. Each node in the graph is enriched with multimodal data, including semantic, spatial, and temporal attributes, including textual embeddings from crash narratives using a pretrained Sentence-BERT model. We evaluate various graph neural network (GNN) architectures, such as Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and Dynamic Spatio-Temporal GCN (DSTGCN), to classify crash severity and predict high-risk regions. Our proposed ST-GraphNet, which utilizes a DSTGCN backbone on the coarse-grained H3 graph, achieves a test accuracy of 97.74%, substantially outperforming the best fine-grained model (64.7% test accuracy). These findings highlight the effectiveness of spatial aggregation, dynamic message passing, and multi-modal feature integration in capturing the complex spatio-temporal patterns underlying AV crash severity.
💡 Research Summary
The paper introduces ST‑GraphNet, a spatio‑temporal graph neural network framework designed to model and predict the severity of crashes involving automated vehicles (AVs). Using a balanced dataset of 2,352 real‑world AV‑related crash reports from Texas in 2024, the authors construct two complementary graph representations. The first, a fine‑grained graph, treats each individual crash as a node and connects nodes based on spatio‑temporal proximity (e.g., within a certain distance and time window). The second, a coarse‑grained graph, aggregates crashes into spatial cells defined by the Hexagonal Hierarchical Spatial Indexing (H3) system; each H3 cell becomes a super‑node, and adjacency is defined by hexagonal neighbors.
Each node is enriched with multimodal features: structured attributes such as geographic coordinates, timestamps, SAE automation levels, and roadway characteristics, as well as unstructured textual narratives. The narratives are transformed into 768‑dimensional embeddings using a pretrained Sentence‑BERT model, then concatenated with the structured features and normalized. This multimodal representation enables the network to capture both quantitative traffic factors and qualitative context (e.g., “failure to control speed”).
The authors evaluate three graph neural network architectures—Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and Dynamic Spatio‑Temporal GCN (DSTGCN). DSTGCN extends static GCNs by learning a time‑varying adjacency matrix and incorporating a GRU‑based temporal module, thereby capturing evolving traffic patterns and non‑stationary spatial dependencies.
Experimental results show a striking performance gap between the two graph granularities. When the DSTGCN backbone is applied to the coarse‑grained H3 graph, the model achieves a test accuracy of 97.74 %, far surpassing the best fine‑grained model (64.7 %). GCN and GAT on the H3 graph also perform well (≈91 % and 93 % respectively) but remain below DSTGCN, indicating that dynamic message passing is crucial for this task. Precision, recall, and F1‑score metrics follow the same trend.
Interpretability analyses—visualizing attention weights and gradient‑based importance scores—reveal that high‑risk regions correspond to intersections, highway on‑ramps, and cells with a high proportion of low‑automation‑level (SAE 1‑2) vehicles. Textual keywords such as “speed control failure” and “run‑through stop sign” receive strong attention, confirming that the model effectively leverages narrative information.
The study acknowledges several limitations: (1) the dataset is confined to a single state, raising questions about geographic generalizability; (2) class balancing was achieved by undersampling the majority “no‑injury” class, potentially discarding useful information; (3) sensitivity to H3 cell resolution was not explored; and (4) real‑time deployment considerations (model size, inference latency) are not addressed.
Future work is suggested in three directions. First, expanding the dataset to multiple regions or countries to test transferability. Second, incorporating multi‑scale graph hierarchies (e.g., road‑segment → H3 cell → city) to capture both fine‑grained interaction and broader spatial trends. Third, integrating reinforcement‑learning or probabilistic GNN extensions to provide actionable risk warnings for AV control systems and urban planners.
In summary, ST‑GraphNet demonstrates that spatial aggregation via H3 cells, dynamic spatio‑temporal message passing, and multimodal feature fusion together yield a highly accurate and interpretable model for AV crash‑severity prediction. The approach advances beyond traditional statistical or static machine‑learning methods, offering a promising tool for safety‑critical decision‑making in the emerging era of automated transportation.
Comments & Academic Discussion
Loading comments...
Leave a Comment