Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

Reliable multi-horizon traffic forecasting is challenging because network conditions are stochastic, incident disruptions are intermittent, and effective spatial dependencies vary across time-of-day patterns. This study is conducted on the Ohio Depar…

Authors: Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler

Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers
Long-Horizon T raf fic Forecasting via Incident-A ware Conformal Spatio-T emporal T ransformers Mayur Patil 1 , Qadeer Ahmed 1 , Shawn Midlam-Mohler 1 , Stephanie Marik 2 , Allen Sheldon 3 , Rajee v Chhajer 3 , Nithin Santhanam 3 Abstract —Reliable multi-horizon traffic f or ecasting is challeng- ing because network conditions are stochastic, incident disrup- tions are intermittent, and effectiv e spatial dependencies vary across time-of-day patterns. This study is conducted on the Ohio Department of T ransportation (ODO T) traffic count data and corresponding ODO T crash records. This work utilizes a Spatio-T emporal T ransformer (STT) model with Adaptive Conformal Prediction (A CP) to produce multi-horizon f orecasts with calibrated uncertainty . W e propose a piecewise Coefficient of V ariation (CV) strategy that models hour-to-hour tra vel- time variability using a log-normal distrib ution, enabling the construction of a per -hour dynamic adjacency matrix. W e further perturb edge weights using incident-related severity signals derived from the ODO T crash dataset that comprises incident clearance time, weather conditions, speed violations, work zones, and r oadway functional class, to capture localized disruptions and peak/off-peak transitions. This dynamic graph construction replaces a fixed-CV assumption and better represents changing traffic conditions within the for ecast window . For validation, we generate extended trips via multi-hour loop runs on the Columbus, Ohio, network in SUMO simulations and apply a Monte Carlo simulation to obtain travel-time distrib utions f or a V ehicle Under T est (VUT). Experiments demonstrate improved long-horizon accuracy and well-calibrated pr ediction intervals compared to other baseline methods. Index T erms —T raffic flow pr ediction, adaptive adjacency ma- trices, spatio-temporal transformer , uncertainty quantification I . I N T R OD U C T I O N U RB AN traffic networks operate under continual inci- dental changes such as ever -changing weather , road constructions, breakdowns, crashes, special ev ents, and hetero- geneous driver behaviors. Due to these variables, forecasting models must adapt to changes that can occur within short time periods throughout the day . According to the 2024 INRIX Global Traf fic Scorecard [1], an average driv er in the U.S. spends about 43 hours stuck in traf fic, resulting in an estimated loss of $771 in time for each driv er . New Y ork and Chicago ranked the highest, with dri vers in those cities losing 102 hours each, highlighting the sev erity of the issue. At the same time, roadway safety remains a major concern; the 1 Mayur Patil, Qadeer Ahmed, and Sha wn Midlam-Mohler are with the Department of Mechanical and Aerospace Engineering and the Center for Automotiv e Research, The Ohio State Univ ersity (e-mail: patil.151@osu.edu ; ahmed.358@osu.edu ; midlam-mohler.1@osu.edu ). 2 Stephanie Marik is with the Ohio Department of T ransportation 3 Allen Sheldon, Rajeev Chhajer , and Nithin Santhanam are with Honda Research Institute USA, Inc. most recent national crash totals are reported for 2023 (6.14 million police-reported crashes, 40,901 fatalities, and about 2.44 million injuries) [2]. Full-year crash totals for 2024 are not yet finalized; howe ver , NHTSA ’ s early estimates indicate that about 39,345 people died in motor vehicle traffic crashes in 2024 [3]. These numbers motiv ate predictiv e models that are aware of incident-driv en disruptions in traffic operations, beyond recurrent congestion patterns. W eather further compli- cates operations by amplifying delays and risks during peak periods. Federal guidance indicates that adverse weather is the second-largest source of non-recurrent congestion, where even light rain can inflate travel-time delays by 12-20% [4]. The practical need is clear: agencies and mobility providers require reliable, long-horizon forecasts with calibrated confidence to plan operations under variability and disruptions. From a modeling standpoint, long-horizon/multi-hour traf fic forecasting has always been a challenge as traffic in itself ev olves throughout the day (off-peak to peak hour and back). Furthermore, exogenous factors such as roadway incidents, weather , and construction work disrupt traf fic flo w patterns. Con ventional sequence models, such as recurrent neural net- works (RNNs), are capable of handling temporal features, but they often run into issues with error accumulation and hav e less freedom to perform parallel processing ov er longer time periods. While the transformer-based designs use a self- attention mechanism to find long-range temporal relationships, and can be adjusted to include spatial relationships. Recent dev elopments in transportation applications sho w that the attention-based approach can impro ve multi-hour accurac y and is capable of error stabilization by explicitly learning which time steps and locations matter most as the horizon extends [5]–[7]. The purpose of lev eraging a transformer model for long- horizon forecasting is to look far back in history without encountering the vanishing gradient problem. This provides an adaptable mechanism to share the information spatially and scale it efficiently using parallel attention. Simultaneously , long-horizon accuracy significantly depends on how the spa- tial dependencies are represented, as the information of the locations can vary with time-of-day variability and disruptions caused by exogenous incidents. In our earlier work [8], we lev eraged a Graph Attention Network with Long-Short T erm Memory (GA T -LSTM) with Adaptiv e Conformal Prediction (A CP), constructing the spatial adjacency matrix A ij from a log-normal distribution repre- 1 senting travel-time information. The log-normal distribution are widely used to represent travel-time stochasticity and has been reported to fit observed trav el times across operating regimes in both roadway and transit settings [9], [10]. W e parameterized the distribution using a fixed coef ficient of variation (CV) by taking random values of travel-time to make A ij dynamic in nature. Basically , it governed the dispersion of sampled travel times that we mapped onto the edge weights of GA T , and the same graph effecti vely gov erned all hours as shown in Figure 1 a. Although practically on average, the fixed-CV assumption implicitly treats the network’ s v ariability as time-inv ariant. In this w ork, we replace the “fixed-CV” assumption with a piecewise design by estimating an hour-of-day CV profile C V ( h ) from historical traffic flow data (higher CV repre- sents peak times, while lower CV represents of f-peak/shoulder times). The intent is to sample log-normal edge travel times for each hour h using C V ( h ) , then introduce an incident-aware perturbation using a crash-deri ved se verity signal calculated using incident clearance time, weather f actors, speed-violation, work zones, and roadway functional class. T o set up the correlation scale, we estimate edge-wise correlations between the sampled travel-time and the crash-derived attributes. W e then modulate the adjusted travel times into hour-conditioned adjacency weights A ij ( h ) through a decaying kernel and nor - malization. This A ij ( h ) is then a set of 24-hour-conditioned graphs A ij ( h ) being fed to the model, where each training window carries its hour tag and applies the corresponding graph as visualized in Figure 1 b . T o quantify uncertainty in our predictions, we retain the A CP methodology , leveraging the conformal calibration core to provide reliable confidence bounds. T o accomplish this task, we adopt a Spatio-T emporal T rans- former (STT) architecture for efficient, parallel sequence mod- eling and inte grate hour-conditioned graph derived from the proposed hour-conditioned adjacency matrices. The contribu- tions of this paper are as follows: • W e consider an hourly CV profile from historical traffic flow data, where each hour’ s CV parameterizes a log- normal distribution sample of travel times. This yields a set of hour-conditioned adjacency matrices A ij ( h ) for multi-hour forecasting. • W e introduce crash-aware perturbations of edge weights using incident-related sev erity signals derived from a crash dataset (incident clearance, weather factors, work zone areas, speed violations, and roadway functional class), enabling localized disruption sensitivity in the dynamic graph. • W e retain ACP for finite-sample coverage guarantees on forecasts, yielding reliable confidence around forecasted values. • W e run multi-hour SUMO simulations via Monte Carlo sampling for a V ehicle-Under-T est (VUT) trajectory and validate it against INRIX trav el-time data. The rest of the paper is or ganized as follo ws: Section II summarizes prior work on traf fic flow forecasting. Section III presents the proposed frame work and methodology . Section IV Fig. 1. Static vs. piece-wise graph priors: 1a) one fix ed-CV ˆ c drives a log-normal travel-time prior; the resulting adjacency A ij is reused across hours, 1b) an hourly profile C V ( h ) yields hour-conditioned priors and distinct adjacencies A ij ( h ) . Edge colors/thicknesses are schematic. reports the experimental setup and comparati ve results against baseline models. Section V concludes the paper and outlines future research directions. I I . R E L A T E D W O R K A. T raf fic F or ecasting Models and Graph Learning In the past, traf fic forecasting was conducted using con- ventional univ ariate and multiv ariate time-series models that considered each sensor stream independently . Because of its simplicity and interpretability , the original Box-Jenkins family of AutoRegressiv e Integrated Moving A verages (ARIMA) continues to be a typical baseline for short-term forecasting [11]. Similarly , exponential-smoothing methods such as Sim- ple Exponential Smoothing (SES) [12], Holt’ s linear trend method [13], Holt-W inters seasonality [14], and Error Trend- Seasonality (ETS) state-space models [15] are other short-term traffic speed/flow prediction [16]. Howe ver , these approaches are only limited to short-horizon prediction as their linear dynamics struggle with nonlinear congestion patterns, they do not ha ve encoding for spatial interactions, and their mechanism is restrictiv e to handle disruptions such as road incidents, demand surges, weather , etc. Howe ver , the field made a significant impact when deep sequence models were introduced. They were sho wn to be ca- pable of learning nonlinear dynamics inherently from the data. The early stacked feedforward and autoencoder -based models showed good performance for short-term traffic forecasting 2 [17]. Recurrent neural networks (RNNs) lik e LSTM/GR U v ari- ants improved multi-step prediction by maintaining temporal state and reducing the v anishing gradients problem compared to vanilla RNNs [18], [19]. Long-horizon scenarios, howe ver , may result in the accumulation of error due to repeated rollouts, and unless explicitly introduced by design, spatial relationships remain implicit. The bottleneck for long-horizon forecasting is that the model has to decide “what to remember” from the history and “how to roll forward” without “losing attention”. First and foremost, many temporal modeling approaches were proposed specifically to reduce this distraction. For instance, Seq2Seq encoder-decoder models addressed this problem by separating historical encoding from future decoding, so that the forecast- ing depends on a concise representation of the historical data and hav e an option to attend back to it when needed. Similarly . con volution-based approaches like LSTNet combines temporal con volutions and can skip certain connections to capture important patterns at [20], whereas T emporal Con volution Network (TCN) use dilated causal con volutions to model long- horizon aspects without distinct recurrent states [21]. Further- more, attention-augmented recurrent models like dual-stage attention-based recurrent neural network (D A-RNN) modify the weights based on the current context, hence handling the of f-peak and peak condition changes [22]. Secondly , the spatio-temporal forecasting models improv ed when they coupled the temporal model with an explicit graph-based framew ork to share information spatially . The ke y idea is that the model should be able to pass messages between correlated road se gments rather than expecting the temporal module to infer spatial relationships. The early baselines implemented this by integrating temporal module with graph con volutions on a graph prior . For instance, STGCN fuses temporal con volutions with graph con volutions to capture general space-time relationships [23]. Then, DCRNN approach proposed a dif fusion process on a directed graph and pro- posed a gated recurrence to modify node states ov er time [24], while T -GCN approach similarly integrates con volution- based spatial integration with gated temporal dynamics [25]. Furthermore, an attention-based extension was put forth by ASTGCN model which relaxed rigid neighborhood influence by learning spatial-temporal important ev ents, thus capturing periodic information (hourly/daily/weekly) and heterogeneous node impacts [26]. Beyond these common baselines, many variants refines the temporal module, graph structure, or both. For example, traffic graph conv olutional recurrent designs bolster the prediction frame work by explicitly communicating information along the graph edges rather than ov er-utilizing the temporal module to re-learn spatial information from scratch [27]. Another adv ancement was made, and spatio-temporal GNNs were matured in terms of introducing dynamics in creat- ing the adjacency matrix. Methods were proposed to make a shift from a static distance-based adjacency matrix to a more adaptiv e adjacency that is learned from data. Graph W av eNet introduced a data-dri ven spatial module and created the adjacency matrix alongside dilated causal con volutions, greatly reducing error accumulation across multi-step hori- zons in the case of incomplete/misspecified sensor informa- tion [28]. Other models, such as GMAN used an encoder- decoder attention architecture to learn when and where to accumulate the information, while replacing the custom-built diffusion layer with learned spatio-temporal aggregation [29]. A parallel line of work studied explicitly dynamic graphs, e.g., dynamic graph con volution formulations for multi-step traffic forecasting [30], multi-weight graph architecture that put together multiple information types [31], and graph- learning STGCN v ariants that utilized the adjacency matrix as part of training data [32]. Other work emphasized on fusing v arious features/attrib utes, where multiple other factors of the network state were integrated instead relying on a single traf fic attribute to improve robustness under dynamic operating conditions [33]. Interestingly , meta-learning has also been explored to improve transferability across the network, e.g., spatio-temporal meta-learning frameworks being capable of generalizing under distrib ution shifts [34], and multi vari- ate time-series graph forecasting models such as MTGNN proposed on learning graph-structure and message passing specific to high-dimensional correlated series [35]. Despite significant progress, most of the above studies fre- quently considered short-horizon forecasting (typically 15-60 minutes) and/or under very smooth connectivity assumptions. On the other hand, multi-hour forecasting (e.g., predicting 2, 4, 8, or 12 hours ahead) is harder because the spatial influence pattern is not time-in variant across the day as there is peak build-up, peak dissipation, and localized disruptions that af fect which links dominate the changes. This is precisely the gap we target and prioritize multi-hour forecasting by providing an hour-conditioned, variability-a ware spatial information instead of a single static graph, so that long-horizon performance is not driv en purely by extrapolation from a stationary connectivity assumption. B. Long-horizon forecasting Only a small but rising number of papers worked on the long-horizon forecasting problem. An interesting Graph Pyramid Autoformer family (X-GP A/GP A) was proposed to explicitly ev aluate multi-hour settings (e.g., 2/4/8/12 hours). T o focus on long-range shift stabilization, an autocorrelation- based attention approach was utilized with a p yramidal multi- scale temporal representation [36], [37]. Similarly , long-term graph learning studies also worked in the same avenue and support the claim that extending forecasting horizon beyond sev eral hours requires stronger inductiv e bias to prevent model drift, including formulations that e xplicitly ev aluate long-range prediction on graphs [38], [39]. The latest works push the design capability and extend the prediction horizons. One such model, TrafF ormer , specifically defines the task of predicting traffic up to 24 hours using a Transformer framework tailored for handling long-term sequences [40]. Another model, HUTFormer , similarly does long-term forecasting beyond 1-hour and discusses the outlook on up to 1 day by leveraging hierarchical multi-scale represen- tations to handle long input/output horizons comprehensiv ely [41]. Additionally , system-oriented work such as F oresight 3 Plus focuses on establishing spatio-temporal traffic forecasting in a serverless scenario. Also, it discusses practical approaches for running the latest long-horizon models in a production pipeline [42]. Another approach uses common traffic baselines to assess robustness in multi-hour regimes explicitly . Using roughly 2-hour and 4-hour prediction settings on 5-minute datasets (e.g., PEMS-BA Y), SpikeST A G reports long-horizon results, emphasizing that multi-hour horizons rev eal vulnerabilities in short-horizon-tuned architectures and that performance gains are frequently linked to maintaining cross-node propagation structure as the horizon expands [43]. Lastly , a GA T -Informer model was proposed that lev erages a Transformer -like long- term forecasting module with graph attention and an Informer- like temporal modeling to perform tests at long horizons (e.g., up to 2 hours on a 15-minute dataset) [44]. C. T ransformer Ar chitectur es for T raffic F orecasting An astounding innov ation of self-attention was recently dev eloped, and it took ov er the artificial intelligence field with incredible results. This was the wav e of Transformers, which sho wed that instead of relying on recurrent cells or fixed-recepti ve-field con volutions, the architecture was built on the standard Transformer design [45] and uses multi-head attention to connect distant time steps and locations in a single layer . Hence, making them especially attracti ve for multi-hour horizons and highly non-stationary urban regimes. A Traf fic T ransformer approach [46] was one of the first traffic-specific T ransformer designs, which extensiv ely used the continuity and periodicity of traffic time series. The architecture simultaneously models spatial correlations be- tween the traf fic stations and incorporates temporal aspects into an encoder module. Large-scale urban dataset exper- iments clearly outperform RNN/CNN baselines, especially when long-range periodic patterns are critical. On the same lines, Spatio–T emporal T ransformer Networks (STTN) model was proposed, which decoupled spatial and temporal modules into a separate Transformer module that handled v arious traffic sensor graphs [47]. Here, the spatial module captures dynamic multi-head dependencies between the nodes, and the temporal module handles the long-range time-series history . This encoder–decoder design showed significant state-of-the- art performance on the widely used PeMS dataset, particularly at long horizons where small temporal modules and static graphs showed inadequate performance. A dynamic, hierarchical spatial–temporal Transformer for- mulation was also proposed [48], by stacking Transformer blocks with hierarchical attention patterns to capture short- term interactions (local) and long-term congestion (global) ev olution simultaneously . The outcome was an architecture that outperforms STGCN/DCRNN baselines on multi-step traffic speed prediction and also provided interpretable at- tention maps sho wing roads and time periods influencing the performance. A bidirectional spatial–temporal adaptiv e T ransformer (Bi-ST A T) was later proposed by applying self- attention in both forward and backward temporal directions. Interestingly , they made the spatial attention pattern com- pletely adapti ve instead of a fixed adjacency matrix [49]. This particularly raised interest in terms of ho w and why integration of incidents, lane closures, and other regime shifts that which is critical to capture real-world representation. A number of studies focused on improving the temporal module of the T ransformer architecture for traffic flow pre- diction. Specifically for traffic flow states, a T ransformer- based deep neural network was proposed that enhanced the fundamental self-attention process, enhancing embedding and decoder structure [50]. T o better capture complex spa- tio–temporal correlations while maintaining efficiency , a Fast Pure T ransformer Network (FPTN) was proposed that used sequential traffic data and introduced multiple embeddings for sensor position, and temporal/positional aspects [51]. An im- prov ed version of T ransformer for traffic flo w prediction was later proposed by modifying input embedding and temporal attention by explicitly modeling both long-range and short- range dependencies [52]. Then, a T ransformer-based short- term traffic forecasting model was introduced that systemati- cally compared attention-based designs with con ventional deep networks, while confirming that self-attention can produce more stable performance across different congestion le vels and short horizons [53]. T ransformers and graph-based spatial modeling are closely integrated in one of the other studies [54], introducing a spatial–temporal T ransformer network that improv es the traffic flow prediction using a pre-trained language model; introduc- ing contextual information using T ransformer blocks along with spatio–temporal inputs. A spatial–temporal T ransformer- based traffic flow prediction model, ST -T ransNet, was another distinct model for bridge networks, which jointly encodes cor- relations among bridges and temporal dynamics using stacked self-attention layers [55]. Furthermore, a graph-enhanced spa- tio–temporal T ransformer was introduced that integrated graph con volutions with a Transformer layer to learn spatial repre- sentations and attention-based temporal modeling [56]. An- other approach took this further by separating graph-based spa- tial encoding and Transformer -based temporal encoding, and then combining them again [57]. During the same time, a hy- brid model was proposed that incorporated a spatio–temporal T ransformer with graph con volutional networks (GCNs), ex- hibiting that joint attentions are much better in capturing regional interactions along with temporal dependencies [58]. As multiple works are being done following the Transformer - based approach, the T ransformer-Enhanced Adaptiv e Graph Con volutional Network (TEA-GCN) was introduced, where T ransformer modules enhance the adapti ve graph learning, allowing the spatial framew ork to be refined in a data-driven way during training [59]. At a more comprehensi ve scale, a learnable long-range graph Transformer (LLGformer) was published that improved traditional T ransformer-style models with techniques intended for more efficient learning of long- horizon dependencies in traffic flow data on large graphs [60]. In the above studies, we found three common points: i) Multi-head self-attention temporally offers a flexible mecha- nism to extract longer/multi-scale histories, reducing the van- ishing gradient problem and the limiting long-range training difficulties of RNNs and TCNs, ii) T o allo w cross-link ef fects to change with time and traf fic conditions rather than being 4 fixed by a static graph, many topologies changed from fixed, hand-crafted adjacenc y matrices to data-driv en or adapti ve spa- tial attention, and iii) In multi-step rollouts, encoder-decoder designs and partially parallel decoding techniques reduce ex- posure bias and error accumulation. Howe ver , the majority of the current w orks still consider the spatial structures relativ ely slowly v arying, and they rarely encode variability caused by exogenous factors directly into the spatial module. In this work, we lev erage the Transformer -based framework but fuse the encoder–decoder architecture with a variability-a ware, piecewise, hour -conditioned graph deri ved from coefficient of variation choices. By injecting this connectivity into the T ransformer architecture, the model has stochasticity av ailable from the adaptiv e adjacency matrix purely from a data-driven source. D. Incident context featur es Non-recurrent congestion caused by crashes is a notable source of forecasting error whether it’ s short-term or long- term. Its consequences are majorly dependent on factors like the crash/incident clearance time, weather/road condition at the time of crash, the type of roadway (functional class), construction or work-zone disruption, speed infractions, so on and so forth [61]. As crash incident response and recovery time can last from minutes to hours, the ef fects are especially appropriate for multi-hour horizons where the traffic states can transition from of f-peak to peak hours to disrupted conditions in a single forecast windo w . As a result, a substantial amount of research directly pre- dicts the event dynamics, mostly through the prediction of incident length. The statistical learning and machine learning techniques such as hazard models, tree-based models, mixture models, and deep learning architectures can predict the length of the e vents and quantify the impact of other related factors as shown by some latest studies and revie w works [62]–[64]. These studies repeatedly emphasize that characteristics that control how disruptions dev elop, such as response situations and clearance-related variables, that influence incident impacts in addition to the ev ent itself. This body of research encour- ages handling incident-related data as a prime signal rather than a rare incident that a general forecasting model should automatically consider . Simultaneously , rich models for crash frequenc y , crash risk, and collision se verity hav e been constructed by road-safety analytics. These models explicitly use exogenous descriptors, such as weather, roadway class/type, construction zones and variables related to speed. The occurrence and impact of crashes are structured in both location and time, as sho wn by spatio-temporal crash modeling and risk mapping [67], [68]. W eather and highway context (including road hierar- chy/functional class) are among the most relev ant factors for explaining the crash outcomes, according to sev erity- prediction studies [69]–[72]. The f actors of interest are ma- jorly speed-related as NHTSA reports that over -speeding con- tributed almost 29% of traffic fatalities in 2023, emphasizing why speed-violation indicators are important factors when characterizing crash se verity [65]. T ransportation of ficials hav e long stressed that bad weather exacerbates both safety and mobility problems, as we pointed out that trav el time is increased by 12-20% [4], So, adverse weather is a leading contributor to non-recurrent delays. In order to improve traffic speed prediction during disrup- tions, [66] employs ev ent detection and representation learning to con vert raw incident information into latent features. These factors are frequently integrated with graph-based models to capture the propagation patterns. Although these models confirm that incident factors lowers error during disruption periods, incident information is typically incorporated as an additional inputs in the form of a time-series data, or learned embeddings rather than as a coupling them explicitly with the spatial dependency structure. In this study , we utilize crash context features with dy- namic spatial structure. W e employ incident clearance time, weather , speed violations, work zone areas and functional class to construct an incident-sev erity signal that modifies edge weights in an hour -conditioned adjacency matrix. This approach aligns with the logic found in the abov e studies where disruption impact depends on contextual factors, and its network influence is reflected in the effecti ve connectivity used for prediction. By embedding this disruption signals into the spatial module itself, the forecasting model begins each window with a connecti vity that is already biased tow ards the propagation patterns, rather than learning all such effects purely from historical time series. E. Uncertainty Quantification Uncertainty quantification is critical for multi-hour traffic forecasting, as uncertainty is state-dependent and increases with the horizon window . Essentially , crash-related disrup- tions (and their associated context, such as weather , speed- related beha vior, work zone, and roadway functional class) can introduce abrupt distribution shifts that are not well represented by point forecasts alone. Consequently , systems do require calibrated confidence bounds in addition to point estimates. Earlier work in traffic forecasting has quantified uncertainty using Bayesian or approximate-Bayesian methods, quantile-based models, and resampling or ensemble strategies. Bayesian learning formulations hav e been applied to traffic speed prediction with uncertainty estimates [73]. Other prac- tical approaches, such as Monte Carlo dropout [74] and deep ensembles [75], are widely used because they can utilize exist- ing neural forecasters but at the cost of increased computation. Hence, they struggle to provide coverage guarantees under a distribution shift. T o handle the distribution shifts, Conformal prediction (CP) offers an exceptionally good performance by offering a com- plementary , distribution-free route by calibrating interval width from validation residuals, resulting in finite-sample cov erage guarantees [76]. The CP family has conformalized quantile regression (CQR) that improv es efficienc y by conformalizing learned quantiles [77], then for non-stationary settings, adap- tiv e conformal methods were dev eloped to update calibration online to maintain co verage under drift [78], [79]. In this work, we retain an adapti ve CP (ACP) style conformal calibration 5 layer while strengthening the base forecaster with crash- informed, time-varying graph connecti vity . This pairing is adopted for incident-dri ven traffic where the dynamic graph reduces the misspecification during disruptions, while A CP provides reliability-controlled uncertainty intervals for the residual uncertainty . I I I . P R O PO S E D M E T H O D O L O G Y A. Graph F ormulation and Data A vailability Follo wing our prior formulation [8], we model the sensor network as a directed graph G = ( V , E ) (nodes are stations; edges are feasible directed links). where we have two types of stations i) Continuous Count Stations (CCS), and ii) Non- Continuous Count Stations (N-CCS), providing traffic data in a rich (5/15 min interv al; each day of the year) and sparse manner (maybe for a few days/months only), respecti vely . W e are reiterating the approach in this work- each node i is assigned a data-av ailability score a i ∈ [0 , 1] : a i =      1 , i is CCS C i max j C j , i is N-CCS (1) where C i is the count data from N-CCS node i , and max j C j is the maximum count observ ed among all N-CCS stations. Then, we modify it to an edge-wise reliability mask as: A av ail [ i, j ] = a i a j (2) W e apply A av ail as a multiplicati ve mask on the learned hour- conditioned adjacency in later sections. (Please note that E is the edg e notation differ ent than ε > 0 which is used thr oughout the paper as tiny constant to avoid division by zer o.) B. Hourly Adaptive Adjacency Matrix F ormulation Considering network variability as time-variant is one of the main objectiv es of this work. Practically , the spread of trav el times increases significantly during peak hours and decreases during off-peak hours. T o capture this ef fect, we construct an hour-of-day coefficient-of-v ariation profile: CV ( h ) ∈ [0 . 1 , 1 . 0 , ] , h ∈ { 0 , 1 , . . . , 23 } (3) estimated from historical flow data. Understandably , the lo w traffic conditions, like late night, are represented by a small CV(h), and hea vy traf fic times, like morning/ev ening peaks, are represented by a larger CV(h). For instance, if we consider a typical case, very early-morning hours such as 03:00-05:00 AM typically fall near CV ( h ) ≈ 0 . 15 - 0 . 25 (free-flow with small spread), he mid-morning and early-afternoon periods around 08:00-10:00 and 13:00-14:00 rise to CV ( h ) ≈ 0 . 35 - 0 . 50 , and the main peaks around 16:00-18:00 can reach CV ( h ) ≈ 0 . 70 - 0 . 90 . After the e vening peak, the coefficient gradually drops again, with late-night hours (21:00-01:00) returning toward CV ( h ) ≈ 0 . 20 - 0 . 30 . Let the baseline mean travel time between stations i and j be represented by T mean ij . For a given hour h , we model the stochastic travel time T ( h ) ij as a log-normal distribution: T ( h ) ij ∼ LogNormal  µ ( h ) ij, ln , σ ( h ) ln  (4) with the constraint that the distribution has mean T mean ij and coefficient of v ariation CV ( h ) . For hour-specific cases, we compute the parameters of the lognormal distribution as: σ ( h ) ln = q ln  CV ( h ) 2 + 1  (5) µ ( h ) ij, ln = ln  T mean ij  − 1 2  σ ( h ) ln  2 (6) W e then draw samples from T ( h ) ij and assemble them into an hour-specific travel-time matrix, and repeating this for all h ∈ { 0 , . . . , 23 } produces a piecewise (hour-index ed) collection of trav el-time adjacency matrices:  T ( h )  23 h =0 (7) where each T ( h ) is e xplicitly aligned with the hour’ s v ariabil- ity using CV ( h ) . This matrix collection serves as the stochastic trav el-time ”prior” that is later con verted into hour-conditioned adjacency weights. This enables the spatial connectivity used by the model to change with the time-of-day rather than remaining fixed across the entire forecasting windo w . C. Crash Attributes Inte gration in Adjacency Matrices W e use the crash data from the Ohio Department of T rans- portation (ODO T), which are then (along with their attrib utes) aligned in space and time with the ODO T traffic count dataset. This is done to enable their aggre gation into hour -conditioned risk signals used in our adjacency construction. Each crash record n provides a location, timestamp, incident clearance time, follo wed by conte xtual attrib utes like weather/functional- class/work-zone code, vehicle speed, and posted speed limit. T able I summarizes the notation used throughout the paper for these attributes. T ABLE I C R AS H D AT A A N D N OT A T I ON S Symbol Description Units / T ype ( ϕ n , λ n ) latitude/longitude degrees t n timestamp datetime C n incident clearance time minutes ω n weather condition code categorical f n roadway functional class code categorical v n observed vehicle speed mph (or km/h) v lim n posted speed limit mph (or km/h) z n work-zone indicator { 0 , 1 } The first step is to perform the spatial mapping, because crash locations do not coincide exactly with the traffic stations. So, we associate each crash to its nearest station in V : π ( n ) = arg min i ∈V d (( ϕ n , λ n ) , ( ϕ i , λ i )) (8) where d ( · , · ) denotes a distance metric (e.g., geodesic distance or a local Euclidean approximation). W e then further assign each crash to its particular hour -of-day , which aligns the crash aggregation with the hour-conditioned trav el-time matrices { T ( h ) } 23 h =0 defined in Equation 7. As our goal is not to treat each attribute as an independent factor in the adjacency matrix, b ut to build a single combined 6 sev erity signal that reflects how disrupti ve a crash can be, which is done by the following formulation: Incident Clearance Time (ICT) factor : Let ¯ C denote the global mean clearance time across the crash dataset. For crash n : c n = C n ¯ C + ε . (9) Speed violation factor : W e define a non-negativ e overspeed ratio as: r n = max  0 , v n − v lim n v lim n + ε  (10) and normalize it using the global mean ¯ r : ˜ r n =    r n ¯ r + ε , ¯ r > 0 1 , ¯ r = 0 (11) W eather and functional-class factors: By comparing mean clearance times by category , we formulate these factors. Let E [ C | ω ] be the mean clearance for weather code ω , and similarly E [ C | f ] for functional class f . Then we define: m ω n = E [ C | ω n ] ¯ C + ε , m f n = E [ C | f n ] ¯ C + ε . (12) The conditional means are computed empirically as: E [ C | ω ] ≈ 1 N ω X n : ω C n (13) E [ C | f ] ≈ 1 N f X n : f C n (14) where N ω and N f are the number of crashes in each weather , and functional-class groups. W ork-zone factor: The work-zone factor z n ∈ { 0 , 1 } is integrated using: m z n = E [ C | z n ] ¯ C + ε , (15) E [ C | z ] ≈ 1 N z X n : z C n (16) where N z denote the number of crashes in work-zone group. Eventually , we arriv e at the proposed definition of crash- lev el combined se verity as: s n = c n · ˜ r n · m ω n · m f n · m z n (17) D. Hourly Crash Risk Signals As the crash records are localized data with their associated attributes, and forecasting models usually ingest a pairwise connectivity matrix (a travel-time-based and hour-specific in our case). In order to address this, we need to first aggregate the crash influence at the node level by hour -of-day , and then project it to edge edge-lev el signal. This modulates the tra vel- time–based connectivity , as each crash record n is mapped to its nearest station π ( n ) ∈ V and has a combined severity score s n . Now , for each hour-of-day h ∈ 0 , . . . , 23 , we define the node risk as the accumulated severity at station i as: C h,i = { n | hour( n ) = h, π ( n ) = i } (18) R( h, i ) = X n ∈C h,i s n (19) T o make it more clear , R( h, i ) depicts the cumulativ e impact of a crash near a station i at hour h . When no crash maps to ( h, i ) , the risk can be R( h, i ) = 0 , meaning no additional incident affect is applied at that station-hour; this does not imply that the resulting graph connecti vity becomes binary . In the subsequent adjacenc y construction, edge weights remain dense and continuous due to the applied tra vel-time kernel. Next step is to standardize R( h, i ) as: b R( h, i ) = R( h, i ) − µ R σ R + ε (20) where µ R and σ R are the global mean and standard de viation of R( h, i ) , and ε > 0 ensures numerical stability . No w , the subsequent graph creation builds on the station pairs ( i, j ) , i.e., dense N × N interactions used to form an hour-based adjacency . Therefore, we need to project the node-le vel signal to the pairwise matrix by summing the risks of the endpoints: b R total ( h, i, j ) = b R( h, i ) + b R( h, j ) ∀ i, j ∈ V (21) The understanding here is that if the endpoint stations ob- serve an increased se verity at hour h , then the connectivity between the two stations is more likely to be affected (maybe due to localized queues propagating upstream/downstream). T echnically , b R total ( h, i, j ) portrays an hour-specific “incident affect” situation that can be used to perturb sampled trav el times before con verting them into the final hour-conditioned adjacency matrix. E. F ormulating Incident-A war e Adjacency Matrix After the formulation of b R total ( h, i, j ) , we now need to es- tablish the relation between each node pair’ s ( i, j ) aggregated crash-context signal and sampled tra vel times. T o accomplish this, we use Pearson correlation coefficients as: ρ ij = Corr h  T ( h ) ij , b R total ( h, i, j )  , (22) where Corr h ( · , · ) denotes the Pearson correlation computed ov er the 24 hourly samples h = 0 , . . . , 23 . And, we bound the correlation coefficient to maintain numerical stability as: ρ ij ← clip( ρ ij , − ρ max , ρ max ) . (23) W e then define the incident-aware effecti ve travel time for hour h as: b T ( h ) ij = T ( h ) ij  1 + ρ ij b R total ( h, i, j )  , ( i, j ) ∈ E (24) Finally , we con vert the incident-aw are travel times into hour- conditioned adjacency weights using a Gaussian kernel [23]: A ( h ) ij = exp   − 1 2 σ 2 b T ( h ) ij T ( h ) max + ε ! 2   (25) T ( h ) max = max ( u,v ) ∈E b T ( h ) uv (26) 7 where T ( h ) max is the maximum effectiv e trav el time at hour h , and σ 2 controls how sharply connectivity decays with in- creasing travel time. Finally , we obtain the adaptive adjacency matrix by applying the data-av ailability mask A av ail defined in Equation 2 to the hour-conditioned adjacency: A ( h ) adaptiv e [ i, j ] = A ( h ) ij · A av ail [ i, j ] (27) This yields a collection of incident-a ware, hour-conditioned adaptiv e adjacency matrices where each A ( h ) adaptiv e combines incident-aware tra vel-time perturbations (incident clearance, weather , functional class, speeding, and work-zone context) with the data-av ailability mask. F . Model Ar chitectur e and Adaptive Conformal Pr ediction (A CP) W e lev erage a Spatio-T emporal Transformer model [47] with an encoder-decoder design (STT -ED) to perform multi- horizon traffic forecasting, exploiting the inherent parallel sequence architecture of a transformer . Let the traffic flo w tensor be F ∈ R T × N , where N = |V | sensors and T is the number of time samples. For each training sample, we form an input window of length L and predict the next H steps: X t = F t − L +1: t ∈ R L × N , Y t = F t +1: t + H ∈ R H × N (28) As our spatial information is hour-conditioned, each window is characterized by an hour-of-day inde x h t ∈ { 0 , . . . , 23 } , extracted from the timestamp of the last step in the window . Now A ( h ) adaptiv e , is injected in the spatial module of the model by selecting the appropriate matrix based on the windo w hour h t . Then, a row-normalized version is implemented for graph mixing: e A t = Ro wNorm  A ( h t ) adaptiv e + I  (29) Ro wNorm( M )[ i, j ] = M [ i, j ] P k M [ i, k ] + ε (30) T emporal T okenization and T emporal Encoder: Unlike recurrent models, the T ransformer processes a sequence of tokens. For each node i , it treats L -step history as a 1D sequence and con verts it into patch tokens using a temporal con volution with patch length p [80] (e.g., p = 6 for 90-min patches with 15-min data). Let x t,i ∈ R L denote the input history for node i . The tokens are formed as: Z (0) t,i = Patc hConv( x t,i ) ∈ R L p × d , L p =  L p  (31) where d is the T ransformer embedding dimension. W e add sinusoidal positional encoding PE( · ) to preserve temporal ordering as: Z (0) t,i ← Z (0) t,i + PE (32) The temporal encoder comprises D t stacked T ransformer encoder blocks. For a generic encoder layer ℓ , multi-head self- attention (MHSA) is computed as: Q = Z ( ℓ − 1) W Q (33) K = Z ( ℓ − 1) W K (34) V = Z ( ℓ − 1) W V (35) A ttn( Q , K , V ) = softmax  QK ⊤ √ d k  V (36) W ith residual connections and layer normalization (LN), the temporal encoder update is: b Z ( ℓ ) = LN  Z ( ℓ − 1) + Dropout(MHSA( Z ( ℓ − 1) ))  (37) Z ( ℓ ) = LN  b Z ( ℓ ) + FFN( b Z ( ℓ ) )  (38) where FFN( · ) is a two-layer MLP with nonlinearity . The output of the temporal encoder produces per-node temporal memories M time t,i ∈ R L p × d . Adjacency Matrix Integration with Spatial Module: T o integrate the hour-conditioned adjacency matrix in the spatial module, temporal tokens are pooled using global average pooling as: u t,i = Pool  M time t,i  ∈ R d (39) W e then learn a trainable node embedding e i ∈ R d to encode each station identity as: h (0) t,i = u t,i + e i (40) Now , stacking all the nodes results in H (0) t ∈ R N × d . Then, we inject the hour-conditioned adaptiv e adjacency using: H mix t = e A t H (0) t ∈ R N × d (41) This is the key mechanism by which A ( h ) directly used in the neural representation. Before the spatial attention is applied, the node embeddings are already skewed by the incident/variability-a ware connectivity for the corresponding hour-of-day . Spatial T ransformer Encoder: After the integration of hour - conditioned adjacency matrix, we refine the spatial interactions using a spatial T ransformer encoder consisting of D s stacked encoder blocks operating ov er the nodes. W e start the spatial encoder from the graph-mixed representation. That is, the output of the adjacency-based mixing step becomes the spatial encoder’ s initial hidden state, H (0 ,s ) t = H mix t . This just ensures, the hour-conditioned spatial graphs are included in the spatial tokens before attention is learned. After the spatial encoder , each node embedding is treated as a single spatial token: s t,i = H ( D s ,s ) t [ i, :] ∈ R d , i = 1 , . . . , N (42) Long-Horizon Decoding with Cross-Attention: After the temporal encoder and the spatial encoder , each node i has temporal memory tokens M time t,i , and a spatial tok en s t,i . Now , a decoder memory is created by concatenation: M t,i = h M time t,i ∥ s t,i i ∈ R ( L p +1) × d (43) 8 T ABLE II S A MP L E O DO T C R A S H D A TA SE T Crash Date/time W eather Latitude Longitude Functional Class V ehicle Speed (mph) Speed Limit (mph) W ork Zone ICT (min) 11/19/2023 11:43 1 40.111327 -83.002978 1 50 65 N 67 05/16/2023 06:50 2 39.832623 -82.736814 3 40 60 N 115 06/19/2023 12:13 2 39.945520 -81.985450 1 50 50 Y 64 ... 11/22/2023 15:25 2 39.931881 -82.907888 1 75 70 N 50 W eather code: 1=Clear, 2=Cloudy , 3=Fog/Smog/Smoke, 4=Rain, 5=Sleet/Hail, 6=Snow , 7=Se vere Crosswinds, 8=Blowing Sand/Soil/Dirt/Sno w , 9=Freezing Rain/Drizzle, 99=Other/Unknown Functional class code: 1-2=Interstates/Freew ays, 3=Principal Arterial, 4=Minor Arterial, 5-6=Collector Roads, 7=Local Roads W ork zone: Y=Y es, N=No T o forecast the next H steps, we use an H -length learnable query seed Q 0 ∈ R H × d (tiled per node) and decode using a standard T ransformer decoder . At each decoder layer , the query first looks at itself (to couple horizons) and then cross- attends to the memory block: Q ′ t,i = A ttn( Q t,i , Q t,i , Q t,i ) (44) Q t,i ← A ttn( Q ′ t,i , M t,i , M t,i ) (45) Finally , the decoded representation is mapped to the H -step flow forecast for node i by a linear head: b Y t [: , i ] = Q t,i W o + b o , b Y t ∈ R H × N (46) Adaptive Conformal Prediction (A CP) for Multi-Horizon Intervals: Following our prior work [8], we estimate uncer- tainty via A CP with epoch-wise adaptiv e calibration. After each training epoch (and once after early-stopping selects the final checkpoint), we compute calibration residuals on the held-out validation set: r t,k,i =    b Y t,k,i − Y t,k,i    , (47) and obtain horizon- and node-specific conformal radii as q (1 − α ) k,i = Quantile 1 − α  { r t,k,i } t ∈D cal  . (48) The resulting prediction intervals are b Y t,k,i − q (1 − α ) k,i ≤ Y t,k,i ≤ b Y t,k,i + q (1 − α ) k,i . (49) Here q (1 − α ) k,i is computed per horizon and per node to reflect increasing uncertainty with k and spatial heterogeneity . Overall architecture of the proposed Spatio-T emporal Trans- former with Adapti ve Conformal Prediction (STT -ED-A CP) is shown in Figure 2. The STT takes as input a traffic flow window X t and the corresponding hour-of-day tag h t , and produces multi-horizon point forecasts b Y t . Crash attrib utes and baseline trav el times are used to construct an hour- conditioned, incident-aw are adjacency matrices { A ( h ) } 23 h =0 , from which the appropriate adjacency is selected based on h t and injected via graph-weighted mixing in the spatial encoder . A CP is applied as a post-forecast calibration layer to generate horizon- and node-specific prediction interv als. Fig. 2. Spatio-T emporal T ransformer (STT -ED) architecture with hour- conditioned adaptive adjacency and Adaptiv e Conformal Prediction (ACP) for multi-horizon traffic forecasting I V . R E S U LT S A N D D I S C U S SI O N W e ev aluate the proposed STT -ED frame work on 2023 ODO T traffic data and the corresponding ODO T crash records (also reported to the Ohio Department of Public Safety (ODPS)). W e set prediction horizons of 1 hr , 2 hr , 3 hr , and 4 hr , with a 24-hour historical look-back period. All models, including the ablation v ariants, were trained with early stopping (patience = 10) and a maximum of 20 epochs. All experiments are conducted on the full network with N = 273 nodes, yielding a 273 × 273 trav el-time-based adjacency ( N 2 node pairs). W e found 20,046 total crashes in Ohio for 2023. T able II sho ws a sample of the crash dataset used to build the incident-aware attribute signals. A. Baseline Methods and Evaluation Metrics W e selected the follo wing baseline models for comparativ e analysis using the same inputs and incident-aware adjacency as the proposed model: 1) Historical A verage (HA) [81]: Predicts traffic flow using the historical mean computed from the training data. 9 2) Autoregressiv e Integrated Moving A verage (ARIMA) [82]: Classical uni v ariate statistical forecasting model, capturing autoregressi ve and moving-a verage dynamics. 3) Feedforward Neural Network (FNN): Multilayer percep- tron capturing non-linear relationships. 4) GCN-GRU [19]: Using gated recurrent units (GR U) for temporal modeling to reduce parameter count and improv e computational efficienc y . 5) GA T -LSTM [8]: Using graph attention (GA T) to learn adaptiv e spatial weights, follo wed by LSTM-based tem- poral forecasting. 6) STGCN [23]: Spatio-temporal graph con volutional net- work that jointly models spatial and temporal depen- dencies using graph con volutions and gated temporal con volutions. 7) DCRNN [24]: Diffusion conv olutional recurrent neural network that inte grates dif fusion-based graph con volu- tions with a sequence-to-sequence recurrent architecture for multi-step forecasting. 8) Graph W av eNet [28]: Dilated temporal con volutional network augmented with adaptive graph learning and diffusion graph con volutions for long-range spatio- temporal dependencies. 9) ASTGCN [26]: Attention-based spatio-temporal graph con volutional network that models both spatial and tem- poral attention mechanisms within a Chebyshe v graph con volution framework. W e report standard accuracy metrics for multi-horizon pre- diction, aggre gated o ver all nodes and forecast steps. Mean Absolute Error (MAE) is: MAE = 1 | Ω | X ( t,k,i ) ∈ Ω    Y t,k,i − b Y t,k,i    , (50) and Root Mean Squared Error (RMSE) is: RMSE = v u u t 1 | Ω | X ( t,k,i ) ∈ Ω  Y t,k,i − b Y t,k,i  2 , (51) where Ω denotes the ev aluation set over time indices t , horizons k ∈ { 1 , . . . , H } , and nodes i ∈ { 1 , . . . , N } . T o e v aluate calibrated uncertainty , we report Prediction Interval Coverage Probability (PICP) and Mean Prediction Interval W idth (MPIW) using ACP intervals: PICP = 1 | Ω | X ( t,k,i ) ∈ Ω 1  b Y L t,k,i ≤ Y t,k,i ≤ b Y U t,k,i  , (52) MPIW = 1 | Ω | X ( t,k,i ) ∈ Ω  b Y U t,k,i − b Y L t,k,i  , (53) where [ b Y L t,k,i , b Y U t,k,i ] is the conformal prediction interval at horizon k and node i . As in our prior work, we omit MAPE because near-zero flows can inflate or render percentage errors undefined, while MAE/RMSE remain stable. B. Adjacency Analysis and P erformance Comparison T o ensure a controlled and interpretable comparison across all models, we ev aluate performance on a fixed subset of N = 50 representative stations, corresponding to a connected route segment within the study network. Figure 3 illustrates the selected route, traffic count stations, and crash sites. W e select this particular Columb us route segment as the length of the route is approximately 47.5 miles (typically v arying from 1 hour to 1 hour 30 minutes), which enables repeatable long-run e xperiments without requiring a much lar ger corridor- scale network. While a natural alternativ e would be to validate long-horizon forecasting on a single 4+ hour interstate corri- dor traversal, doing so would require continuous multi-state data coverage and consistent incident/ground-truth alignment across jurisdictions. Since our study is based on Ohio-only datasets (ODOT traf fic counts and ODOT crash records), we adopt a loop-based e xperimental design on a Columbus corri- dor to emulate e xtended trips while remaining fully supported by the av ailable data sources. Also, since our objectiv e is long-horizon forecasting, we require extended scenarios to ev aluate ho w prediction quality ev olves as conditions transition across the day . Therefore, we are imitating a trav el that ex ecutes “four” repeated loop runs ov er the same segment, which ef fectiv ely spans off-peak to peak (or peak to off-peak) regimes within a single experiment. This design provides a consistent and efficient way to generate long-duration test scenarios while preserving identical spatial structure, and it enables a controlled validation task while undertaking SUMO Monte Carlo simulations. Fig. 3. Sampled Traf fic Route Beyond predictiv e accuracy , we analyze how incorporating crash conte xt alters the underlying spatial connectivity used by the forecasting models. Figure 4 visualizes the change in adjacency weights: ∆ A ( h ) = A ( h ) crash − A ( h ) base for four afternoon hours ( h = 14 : 00 - 17 : 00 ), where A ( h ) base is constructed using CV ( h ) trav el-time sampling, and A ( h ) crash further incorporates crash sev erity . The predominantly negati ve 10 Fig. 4. Crash-induced change in adjacency weights across selected hours ∆ A ( h ) values indicate that crash-induced increases in effecti ve trav el time reduce graph connectivity through the travel-time kernel, thereby weakening spatial coupling between af fected nodes. Importantly , the magnitude and spatial pattern of these changes vary by hour , highlighting the time-dependent impact of incidents on the network. T able III presents the MAE and RMSE for each method across multi-hour prediction windows. T ABLE III P R ED I C T IO N P E RF O R M AN C E C OM PA RI S O N Method MAE RMSE 1h / 2h / 3h / 4h 1h / 2h / 3h / 4h HA 0.921/0.921/0.921/0.921 0.991/0.991/0.991/0.991 ARIMA 0.136/0.327/0.520/0.674 0.191/0.413/0.608/0.752 FNN 0.357/0.380/0.389/0.392 0.511/0.558/0.569/0.571 GCN-GR U 0.310/0.326/0.347/0.401 0.394/0.426/0.490/0.511 GA T -LSTM 0.251/0.265/0.362/0.426 0.363/0.370/0.477/0.514 STGCN 0.321/0.323/0.335/0.359 0.418/0.419/0.462/0.472 DCRNN 0.098/0.227/0.360/0.486 0.148/0.321/0.474/0.614 Graph W aveNet 0.122/0.233/0.350/0.485 0.157/0.282/0.422/0.601 ASTGCN 0.128/0.228/0.329/0.332 0.162/0.311/0.432/0.461 STT -ED 0.087/0.214/0.291/0.317 0.138/0.277/0.374/0.413 It can be observ ed that the strongest competitors to STT - ED are the graph-based baselines (notably DCRNN, Graph W av eNet, ASTGCN, and GA T -LSTM), which consistently outperform classical approaches by leveraging spatial cor- relations across stations. Howe ver , most of these methods exhibit larger error growth as the horizon increases, indicat- ing sensiti vity to long-horizon uncertainty propagation and limited temporal modeling capacity under distribution shifts. In contrast, STT -ED achie ves the best MAE/RMSE across all horizons and shows the most stable scaling with horizon length, suggesting that the encoder-decoder T ransformer struc- ture (with cross-attention decoding) is more effecti ve for long- horizon forecasting. C. Pr ediction Results Figure 5 illustrates multi-horizon traffic flow predictions and corresponding A CP uncertainty interv als for Node 1 (see Figure 3). As the forecast horizon increases from 1 to 4 hours, the prediction uncertainty widens, reflecting the accumulation of temporal uncertainty , while the point forecasts continue to capture the dominant patterns and peak–off-peak transitions. Notably , the A CP intervals expand around periods of higher variability , indicating effecti ve uncertainty calibration without ov erly conservati ve bounds. Fig. 5. Traf fic Prediction with Uncertainty Bounds (Node 1) D. Uncertainty Quantification T o ev aluate the effecti veness of the proposed Adaptive Conformal Prediction (ACP) framework, we benchmark it against the following baselines: 1) Gaussian Process Regression (GPR) [83]: Probabilistic regression framew ork that yields predicti ve distributions by utilizing kernel-based Bayesian formulation. 2) Mean-V ariance Estimation (MVE) [84]: Jointly predicts the conditional mean and variance, enabling uncertainty- aware point forecasts. 3) Deep Ensembles (DE) [75]: Quantifying uncertainty by aggregating predictions from multiple independently trained neural networks. 4) Quantile Re gression (QR) [85]: Estimates conditional quantiles, allowing prediction intervals to be constructed without distributional assumptions. 5) Bootstrap Aggre gation (BA) [86]: Estimates predicti ve variability by training multiple models on resampled training data. 6) Conformal Prediction (CP) [76]: A distribution-free un- certainty quantification method constructing prediction intervals using residuals from calibration set. 7) Conformalized Quantile Regression (CQR) [77]: Com- bines quantile regression with conformal calibration to produce adaptive prediction interv als. These methods serve as strong baselines for assessing calibration quality and interval efficiency of the proposed 11 A CP approach under multi-horizon traffic forecasting settings. T able IV reports PICP and MPIW metrics. T ABLE IV U N CE RTA IN T Y Q UA NT I FI C A T I ON B O U ND S C O M P A R I SO N Method PICP % MPIW 1h / 2h / 3h / 4h 1h / 2h / 3h / 4h GPR 66.64/89.65/88.20/90.11 0.372/3.298/3.303/3.296 MVE 89.34/90.51/91.02/91.44 0.952/1.221/2.565/2.747 DE 71.03/73.10/75.65/77.81 0.701/0.882/1.034/1.175 QR 67.76/66.32/65.57/66.06 0.504/0.655/0.768/0.832 B A 43.74/37.22/36.74/35.50 0.226/0.246/0.251/0.258 CP 90.10/91.50/89.38/89.44 0.571/1.910/2.231/2.895 CQR 91.95/91.47/90.87/90.76 0.668/1.971/1.896/1.458 A CP 94.09/93.51/91.13/92.94 0.582/0.876/1.036/1.148 A CP provides the best coverage among the strongest com- petitors (CP and CQR). Compared to CP , ACP maintains similar (and often slightly higher) PICP while producing sub- stantially tighter intervals, with MPIW reductions of roughly ∼ 53-60% for the 2h-4h horizons. Relativ e to CQR, A CP achiev es comparable cov erage but notably narro wer bounds at 2h-3h (about ∼ 45-55% smaller MPIW), while remaining competitiv e at 4h. Ov erall, A CP deli vers higher coverage while maintaining the tightest (or near-tightest) uncertainty bands among the conformal baselines. E. Ablation Study T o isolate the contrib ution of each architectural block in the proposed STT frame work, we perform an ablation study by removing components from the full encoder -decoder model. In particular, we compare five configurations: 1) STT -ED (Full): the complete encoder-decoder Trans- former used in our main experiments. 2) STT -E (Encoder-only): removes the decoder cross- attention and directly maps learned spatio-temporal em- beddings to multi-horizon forecasts. 3) STT -D (Decoder-only): removes the encoder stack and relies on learned horizon queries with a lightweight memory representation to generate multi-step predic- tions. 4) STT -ED-NoPatch: removes the temporal patch tokeniza- tion (Con v1D tokenization) and instead uses step-wise tokenization, testing the effect of patch-based temporal embedding. 5) STT -ED-NoCrossAttn: removes the decoder cross- attention to the encoder memory , testing whether e xplicit encoder–decoder information flow is necessary for long- horizon decoding. T able V reports MAE/RMSE across the multi-hour horizons. The ablation findings confirm that the full STT -ED achieves the best accurac y and degrades gracefully with horizon, in- dicating that both the encoder memory and decoder cross- attention are key for stable long-horizon decoding. Removing patch tokenization (STT -ED-NoPatch) produces a consistent error increase across horizons from 1h to 4h, suggesting that temporal patching mainly improves representation efficiency T ABLE V A B LAT IO N S T UDY U N D E R D I FF E RE N T S TT C O N FI GU R A T I O N S Configuration MAE RMSE 1h / 2h / 3h / 4h 1h / 2h / 3h / 4h STT -ED 0.087/0.214/0.291/0.317 0.138/0.277/0.374/0.413 STT -ED- NoPatch 0.101/0.238/0.304/0.323 0.142/0.297/0.382/0.461 STT -ED- NoCrossAttn 0.736/0.720/0.719/0.722 0.773/0.789/0.786/0.791 STT -E 0.096/0.221/0.303/0.327 0.151/0.289/0.386/0.425 STT -D 0.137/0.249/0.297/0.322 0.178/0.351/0.383/0.421 and robustness at longer horizons. The encoder-only vari- ant (STT -E) remains close to the full model (MAE within +0.009 to +0.012; RMSE within +0.013 to +0.019), while the decoder-only variant (STT -D) is notably worse at 1h–2h (MAE +0.050/+0.035; RMSE +0.040/+0.074) but becomes competitiv e by 4h, implying that the encoder’ s temporal to- ken memory is particularly beneficial for near-term dynam- ics. In contrast, removing decoder cross-attention (STT -ED- NoCrossAttn) causes a se vere collapse (MAE ≈ 0.72–0.74; RMSE ≈ 0.77–0.79), demonstrating that explicit encoder-to- decoder information flo w is essential for producing meaningful multi-horizon forecasts in this architecture. F . Model V alidation with Multi-Hour Simulation T o v alidate the proposed forecasting framework under real- istic, time-v arying conditions, we extend the simulation-based ev aluation strategy used in our prior work [87] to a multi-hour setting. Unlike single-hour v alidation, this experiment explic- itly assesses the model’ s ability to propagate traffic dynamics as network demand e volv es across consecutiv e hours. Using the same sampled route described in Section IV -B, we conduct microscopic traf fic simulations in SUMO driv en by the model’ s predicted traffic flows for the selected time pe- riods. For each experiment, the STT produces hourly forecasts for all stations along the route, and these predicted flows are injected into SUMO to construct a four-hour traffic scenario. As a result, network demand is updated at each simulated hour , enabling e valuation under realistic transitions between off-peak and peak conditions. The route characteristics are outlined in T able VI. T ABLE VI R O U T E C H A R AC TE R I S TI C S Parameter V alue Route Length (mi) 47.5 Number of Intersections 84 Number of T raffic Lights 39 Number of Lanes (per direction) 1-5 Functional Class Interstates/freew ays, arterials, collectors and locals Mean Speed (mph) 74.5 (TOD range: 63.2-80.0) INRIX congestion indicator (%) 95.9 (Historic average: 96.3) W e focus on two representativ e peak regimes as shown in Figure 6 as observed in historical INRIX trav el-time data. There is a daytime peak from 10:00 AM to 2:00 PM and an evening peak from 6:00 PM to 10:00 PM. The INRIX data represents single-trip tra vel times for the selected 47.5-mile 12 route, aggregated by departure hour (i.e., one traversal of the corridor). For each scenario, we simulate a 200-run Monte Carlo simulation to cover the variability pattern. Fig. 6. INRIX historical travel time data Figure 7 sho ws INRIX corridor heatmaps for two repre- sentativ e departure times (12:00 and 21:00), illustrating the spatial heterogeneity of congestion along the route. W e use these plots as qualitative context to identify recurrent hotspots and support the selection of peak regimes, rather than as direct simulation ground truth. (a) 12:00 (daytime) (b) 21:00 (ev ening) Fig. 7. INRIX Congestion Heatmaps At each Monte Carlo run, a V ehicle Under T est (VUT) navigates the route “four” times; each run consists of multiple sequential traversals, during which traffic flow values are updated hourly based on the model’ s forecasts. Figure 8 compares the resulting simulated travel-time distributions for the two peak regimes, which e xplains the shift in the dis- tribution support (approximately 220-330 minutes) relati ve to the INRIX single-trip statistics (approximately 55-85 minutes). The observed widening of the evening-peak distribution is consistent with the compounding effects of congestion and stop-and-go dynamics across successiv e peak hours. K olmogorov–Smirnov (KS) tests further confirm that the simulated trav el-time samples follow a log-normal distribution Fig. 8. SUMO Trav el time distribution (Simulated) in both cases, v alidating the distributional assumption used in the hour-conditioned adjacency formulation. • Case I (Daytime Peak): log-normal (KS = 0.078/p = 0.156), normal (KS = 0.088/p = 0.081), gamma (KS = 0.0809/p = 0.137) • Case II (Ev ening Peak): log-normal (KS = 0.066/p = 0.333), normal (KS = 0.083/p = 0.116), gamma (KS = 0.071/p = 0.245) Lower KS and p> 0 . 05 for log-normal v alidate both the assumed distribution and simulation fidelity . V . C O N C L U S I O N In this work, we proposed an STT -ED forecasting frame- work with Adapti ve Conformal Prediction (STT -ED-A CP) for long-horizon traffic prediction under time-v arying and incident-driv en conditions. The ke y idea is to replace static spatial connections with an hour-conditioned, incident-aware adjacency matrix by parameterizing travel-time v ariability us- ing a piece wise CV ( h ) profile (log-normal sampling) and per- turbing edge weights using crash-deri ved severity signals. The encoder-decoder Transformer leverages parallel temporal at- tention and cross-attention decoding, while the selected A ( h t ) is injected into the spatial encoder to reflect prev ailing network conditions within each forecasting window . Across all multi- hour horizons, STT -ED achie ves the best MAE/RMSE with slower error growth, and A CP provides near-nominal coverage with tight, horizon-dependent prediction intervals, especially during regime transitions. Finally , multi-hour SUMO loop validation shows realistic travel-time distributions consistent with INRIX observations, supporting both the log-normal trav el-time assumption and the model’ s ability to propagate network dynamics ov er extended horizons. Despite these encouraging results, se veral limitations re- main. First, the hour -conditioned CV profile is estimated from historical data and treated as fixed during inference; while this captures systematic diurnal patterns, it does not adapt online to unexpected demand surges or large-scale ev ents. Second, crash information is incorporated through aggregated se verity signals rather than through explicit spatio-temporal incident ev olution, which may oversimplify complex queue formation and dissipation processes. Third, the current study focuses on a single regional network and a fixed set of horizons; broader generalization across cities, seasons, and sensing resolutions remains to be explored. Finally , A CP calibration is performed epoch-wise rather than fully online, which may limit respon- siv eness under abrupt distribution shifts. 13 Future work will focus on extending to scaling the approach to larger networks and longer horizons, as well as integrating the framew ork into real-time decision-support pipelines for traffic management and traveler information systems, also remain important avenues for continued research. R E F E R E N C E S [1] B. Pishue, “2024 INRIX Global T raffic Scor ecard, ” INRIX Research, June 2025. [2] National Highway Traf fic Safety Administration (NHTSA), National Center for Statistics and Analysis, “Summary of Motor V ehicle Traf- fic Crashes: 2023 Data, ” Crash Stats, DO T HS 813 762, Oct. 2025. [Online]. A vailable: https://crashstats.nhtsa.dot.gov/Api/Public/ V iewPublication/813762 [3] National Highway Traf fic Safety Administration (NHTSA), National Center for Statistics and Analysis, “Early Estimate of Motor V ehi- cle Traf fic Fatalities in 2024, ” Crash Stats, DO T HS 813 710, Apr . 2025. [Online]. A vailable: https://crashstats.nhtsa.dot.gov/Api/Public/ V iewPublication/813710 https://crashstats.nhtsa.dot.gov/Api/Public/V iewPublication/813710 [4] Federal Highway Administration (FHW A), “Road W eather Management Overvie w , ” Of fice of Operations, U.S. Department of T ransportation. [Online]. A vailable: https://ops.fhwa.dot.gov/weather/o vervie w .htm [5] M. Xu, W . Dai, C. Liu, X. Gao, W . Lin, G.-J. Qi, and H. Xiong, “Spatial- T emporal Transformer Networks for Traf fic Flow Forecasting, ” IEEE T rans. Intelligent T ransportation Systems (preprint), 2020. [6] L. Cai, K. Jano wicz, G. Mai, B. Y an, and R. Zhu, “T raffic T ransformer: Capturing the Continuity and Periodicity of Time Series for T raffic Forecasting, ” T ransactions in GIS , 2020. [7] H. Y an, X. Ma, and Z. Pu, “Learning Dynamic and Hierarchical Traf- fic Spatiotemporal Features With T ransformer , ” IEEE T rans. Intelligent T ransportation Systems , vol. 23, no. 11, pp. 22386-22399, 2022. [8] M. Patil, Q. Ahmed, and S. Midlam-Mohler , “Trav el Time and W eather- A ware T raffic Forecasting in a Conformal Graph Neural Network Frame- work, ” IEEE T ransactions on Intelligent T ransportation Systems , early access, 2025, doi: 10.1109/TITS.2025.3611267. [9] Y . Shen, J. Xu, X. Wu, and Y . Ni, “Modelling trav el time distribution and its influence over stochastic vehicle scheduling, ” T ransport , vol. 34, no. 2, pp. 237-249, 2019. [10] E. Mazloumi, G. Currie, and G. Rose, “Using GPS data to gain insight into public transport trav el time variability , ” J. T ransp. Eng. , v ol. 136, no. 7, pp. 623-631, 2010. [11] G. E. P . Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, T ime Series Analysis: F orecasting and Control , 5th ed. Hoboken, NJ, USA: W iley , 2015. [12] R. G. Brown, Smoothing, F orecasting and Pr ediction of Discrete T ime Series . Englew ood Cliffs, NJ, USA: Prentice-Hall, 1963. [13] C. C. Holt, “Forecasting trends and seasonals by e xponentially weighted moving averages, ” in Pr oc. Int. Symp. F orecasting , Carnegie Inst. T ech- nol., Pittsburgh, P A, USA, 1957, republished in Int. J . F orecasting , vol. 20, no. 1, pp. 5-10, 2004. [14] P . R. W inters, “Forecasting sales by exponentially weighted moving av erages, ” Manage . Sci. , vol. 6, no. 3, pp. 324–342, 1960. [15] R. J. Hyndman, A. B. Koehler , R. D. Snyder , and S. Grose, “ A state space framework for automatic forecasting using exponential smoothing, ” Int. J. F orecasting , vol. 18, no. 3, pp. 439-454, 2002. [16] M. Lippi, M. Bertini, and P . Frasconi, “Short-term traffic flow forecast- ing: An experimental comparison of time-series analysis and supervised learning, ” IEEE T rans. Intell. T ransp. Syst. , vol. 14, no. 2, pp. 871-882, 2013. [17] Y . Lv , Y . Duan, W . Kang, Z. Li, and F .-Y . W ang, “Traf fic Flo w Prediction W ith Big Data: A Deep Learning Approach, ” IEEE Tr ansactions on Intelligent T ransportation Systems , vol. 16, no. 2, pp. 865-873, Apr . 2015, doi: 10.1109/TITS.2014.2345663. [18] X. Ma, Z. T ao, Y . W ang, H. Y u, and Y . W ang, “Long short-term memory neural network for traf fic speed prediction using remote microw av e sensor data, ” T ransp. Res. C Emer g. T echnol. , vol. 54, pp. 187-197, May 2015. [19] R. Fu, Z. Zhang, and L. Li, “Using LSTM and GR U neural network methods for traffic flow prediction, ” in Proc. 2016 31st Y outh Academic Annu. Conf. Chinese Assoc. Autom. (Y AC) , Wuhan, China, 2016, pp. 324- 328, doi: 10.1109/Y AC.2016.7804912. [20] G. Lai, W .-C. Chang, Y . Y ang, and H. Liu, “Modeling Long- and Short-T erm T emporal Patterns with Deep Neural Networks, ” in Proc. 41st Int. A CM SIGIR Conf. Resear ch and Development in Informa- tion Retrieval (SIGIR) , Ann Arbor, MI, USA, 2018, pp. 95-104, doi: 10.1145/3209978.3210006. [21] S. Bai, J. Z. Kolter , and V . K oltun, “ An Empirical Ev aluation of Generic Con volutional and Recurrent Networks for Sequence Modeling, ” arXiv pr eprint arXiv:1803.01271 , 2018. [22] Y . Qin, D. Song, H. Chen, W . Cheng, G. Jiang, and G. Cottrell, “ A Dual-Stage Attention-Based Recurrent Neural Netw ork for T ime Series Prediction, ” in Pr oc. 26th Int. Joint Conf. Artificial Intelligence (IJCAI) , 2017, pp. 2627–2633. [23] B. Y u, H. Y in, and Z. Zhu, “Spatio-temporal graph conv olutional networks: A deep learning frame work for traf fic forecasting, ” arXiv preprint arXiv:1709.04875, 2017. [24] Y . Li, R. Y u, C. Shahabi, and Y . Liu, “Dif fusion con volutional re- current neural network: Data-driv en traffic forecasting, ” arXiv preprint arXiv:1707.01926, 2017. [25] L. Zhao, Y . Song, C. Zhang, Y . Liu, P . W ang, T . Lin, et al., “T -GCN: A temporal graph conv olutional network for traffic prediction, ” IEEE T ransactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3848-3858, 2019. [26] S. Guo, Y . Lin, N. Feng, C. Song, and H. W an, “ Attention based spatial- temporal graph conv olutional networks for traffic flo w forecasting, ” in Proceedings of the AAAI Conference on Artificial Intelligence, v ol. 33, no. 01, pp. 922-929, July 2019. [27] Z. Cui, K. Henrickson, R. Ke, and Y . W ang, “Traffic Graph Conv o- lutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 21, no. 11, pp. 4883-4894, Nov . 2020, doi: 10.1109/TITS.2019.2950416. [28] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph W aveNet for deep spatial-temporal graph modeling, ” arXiv preprint arXi v:1906.00121, 2019. [29] C. Zheng, X. Fan, C. W ang, and J. Qi, “GMAN: A graph multi-attention network for traf fic prediction, ” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, pp. 1234-1241, April 2020. [30] G. Li, V . L. Knoop, and J. W . C. van Lint, “Multistep traf fic forecasting by dynamic graph con volution: Interpretations of real-time experiments, ” T ransportation Researc h P art C: Emerging T echnologies , vol. 128, art. no. 103185, 2021, doi: 10.1016/j.trc.2021.103185. [31] Y . Shin and Y . Y oon, “Incorporating dynamicity of transportation network with multi-weight traffic graph con volutional network for traffic forecasting, ” IEEE T rans. Intelligent T ransportation Systems , vol. 23, no. 3, pp. 2082-2092, Mar . 2020, doi: 10.1109/TITS.2020.3031331. [32] N. Hu, D. Zhang, K. Xie, W . Liang, and M.Y . Hsieh, “Graph learning- based spatial-temporal graph con volutional neural networks for traffic forecasting, ” Connection Science , vol. 34, no. 1, pp. 429-448, 2022, doi: 10.1080/09540091.2021.2006607. [33] Z. Li, G. Xiong, Y . Tian, Y . Lv , Y . Chen, P . Hui, and X. Su, “ A multi-stream feature fusion approach for traffic prediction, ” IEEE T rans. Intelligent T ransportation Systems , vol. 23, no. 2, pp. 1456-1466, Feb . 2022, doi: 10.1109/TITS.2020.3026836. Li Z, Xiong G, T ian Y , Lv Y , Chen Y , Hui P , Su X. [34] Z. Pan, Y . Liang, W . W ang, Y . Y u, Y . Zheng, and J. Zhang, “Urban traffic prediction from spatio-temporal data using deep meta learning, ” in Pr oc. 25th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining (KDD) , pp. 1720-1730, 2019, doi: 10.1145/3292500.3330884. [35] Z. W u, S. P an, G. Long, J. Jiang, X. Chang, and C. Zhang, “Connecting the dots: Multiv ariate time series forecasting with graph neural networks, ” in Pr oc. 26th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining (KDD) , pp. 753-763, 2020, doi: 10.1145/3394486.3403118. [36] W . Zhong, T . Mallick, J. Macfarlane, H. Meidani, P . Balaprakash, “Graph Pyramid Autoformer for Long-T erm T raffic Forecasting, ” in Pr oc. IEEE International Confer ence on Machine Learning and Applications (ICMLA) , 2023, doi: 10.1109/ICMLA58977.2023.00060. [37] W . Zhong, T . Mallick, J. Macfarlane, H. Meidani, P . Balaprakash, “Ex- plainable Graph Pyramid Autoformer for Long-T erm T raffic Forecasting, ” arXiv pr eprint arXiv:2209.13123 , 2022. [38] T . Bogaerts, A.D. Masegosa, J.S. Angarita-Zapata, E. Oniev a, P . Hellinckx, “ A graph CNN-LSTM neural network for short and long- term traffic forecasting based on trajectory data, ” T ransportation Re- sear ch P art C: Emerging T echnologies , v ol. 112, pp. 62-77, 2020, doi: 10.1016/j.trc.2020.01.010. [39] J.Q. James, C. Markos, S. Zhang, “Long-term urban traf fic speed prediction with deep learning on graphs, ” IEEE T ransactions on Intel- 14 ligent T ransportation Systems , v ol. 23, no. 7, pp. 7359-7370, 2022, doi: 10.1109/TITS.2021.3069234. [40] D. A. T edjopurnomo, F . M. Choudhury , and A. K. Qin, “TrafFormer: A T ransformer Model for Predicting Long-term T raffic, ” arXiv pr eprint arXiv:2302.12388 , Feb . 2023, doi: 10.48550/arXiv .2302.12388. [41] Z. Shao, F . W ang F , T . Sun, C. Y u, Y . Fang, G. Jin, Z. An, Y . Liu, X. Qu, Y . Xu, “HUTFormer: Hierarchical U-Net T ransformer for Long- T erm Traf fic Forecasting, ” arXiv preprint , Jul. 2023, doi: 10.48550/arXiv .2307.14596. [42] J. Oakley , C. Conlan, G.V . Demirci, A. Sfyridis, H. Ferhatosmanoglu, “Foresight Plus: Serverless Spatio-T emporal T raffic Forecasting, ” GeoIn- formatica , 2024, doi: 10.1007/s10707-024-00517-9. [43] B. Hu, C. Lv , M. Li, Y . Liu, X. Zheng, F . Zhang, W . Cao, and F . Zhang, “SpikeST AG: Spatial-T emporal Forecasting via GNN-SNN Collaboration, ” arXiv pr eprint arXiv:2508.02069 , 2025. [44] Y . Song, R. Luo, T . Zhou, C. Zhou, R. Su, “GA T -Informer: A Graph Attention Informer Model for T raffic Flo w Prediction under the Impact of Sports Events, ” Sensors , vol. 24, no. 15, art. no. 4796, 2024, doi: 10.3390/s24154796. [45] A. V aswani, N. Shazeer , N. Parmar, J. Uszk oreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin, “ Attention is all you need, ” in Advances in Neural Information Processing Systems (NeurIPS) , vol. 30, 2017. [46] L. Cai, K. Janowicz, G. Mai, B. Y an, and R. Zhu, “T raffic Transformer: Capturing the Continuity and Periodicity of Time Series for T raffic Forecasting, ” T ransactions in GIS , vol. 24, no. 3, pp. 736-755, 2020. [47] M. Xu, W . Dai, C. Liu, X. Gao, W . Lin, G.-J. Qi, and H. Xiong, “Spatial-T emporal T ransformer Networks for Traf fic Flo w Forecasting, ” arXiv:2001.02908, 2020. [48] H. Y an, X. Ma, and Z. Pu, “Learning Dynamic and Hierarchical T raffic Spatiotemporal Features With T ransformer, ” IEEE T ransactions on Intelligent Tr ansportation Systems , vol. 23, no. 11, pp. 22386-22399, 2022. [49] C. Chen, Y . Liu, L. Chen, and C. Zhang, “Bidirectional Spatial- T emporal Adaptiv e Transformer for Urban Traffic Flow Forecasting, ” IEEE T ransactions on Neural Networks and Learning Systems , vol. 34, no. 10, pp. 6913-6925, 2023. [50] Y . W en, P . Xu, Z. Li, W . Xu, and X. W ang, “RPConvformer: A no vel T ransformer-based deep neural netw ork for traffic flow prediction, ” Expert Systems with Applications , vol. 218, p. 119587, 2023. [51] J. Zhang, J. Jin, J. T ang, and Z. Qu, “FPTN: Fast Pure Transformer Network for T raffic Flo w Forecasting, ” in ICANN 2023: Artificial Neural Networks and Machine Learning , vol. 14259, Springer , 2023, pp. 382- 393. [52] S. Liu and X. W ang, “ An improved transformer based traffic flow prediction model, ” Scientific Reports , vol. 15, p. 8284, 2025. [53] A. Chang, Y . Ji, and Y . Bie, “Transformer -based short-term traffic fore- casting model considering traffic spatiotemporal correlation, ” F rontier s in Neur orobotics , vol. 19, p. 1527908, 2025. [54] J. Ma, J. Zhao, and Y . Hou, “Spatial–T emporal Transformer Networks for T raffic Flow Forecasting Using a Pre-Trained Language Model, ” Sensors , vol. 24, no. 17, p. 5502, 2024. [55] Y . Tian, W . Li, X. W ang, X. Y an, and Y . Xu, “T ransformer-Based Traf fic Flow Prediction Considering Spatio-T emporal Correlations of Bridge Networks (ST -TransNet), ” Applied Sciences , vol. 15, no. 16, p. 8930, 2025. [56] W . K ong, Y . Ju, S. Zhang, J. W ang, L. Huang, H. Qu, “Graph Enhanced Spatial–T emporal T ransformer for Traf fic Flow Prediction, ” Applied Soft Computing , vol. 170, p. 112698, 2025. [57] W . Sun, R. Cheng, Y . Jiao, and J. Gao, “Decoupled Graph Spatial- T emporal Transformer Networks for Traf fic Flo w Forecasting, ” Engineer - ing Applications of Artificial Intelligence , vol. 148, p. 110476, 2025. [58] J. Zhang, Y . Y ang, X. W u, and S. Li, “Spatio-temporal T ransformer and Graph Con volutional Networks Based T raffic Flow Prediction, ” Scientific Reports , vol. 15(1), p. 24299, 2025. [59] X. He, W . Zhang, X. Li, and X. Zhang, “TEA-GCN: Transformer- Enhanced Adaptiv e Graph Conv olutional Network for Traf fic Flow Fore- casting, ” Sensors , vol. 24, no. 21, p. 7086, 2024. [60] D. Jin, C. Huo, J. Shi, D. He, J. W ei, and P .S. Y u, “LLGformer: Learnable Long-range Graph Transformer for Traf fic Flow Prediction, ” In Proceedings of the A CM on W eb Conference, (pp. 2860-2871), 2025. [61] Federal Highway Administration (FHW A), “T raffic Incident Man- agement Performance Measures, ” Office of Operations, U.S. Depart- ment of T ransportation. [Online]. A vailable: https://ops.fhwa.dot.go v/tim/ preparedness/tim/performance measures.htm [62] R. Li, F . C. Pereira, and M. E. Ben-Akiv a, “Overview of traf fic incident duration analysis and prediction, ” European T ransport Research Review , vol. 10, art. no. 22, May 2018, doi: 10.1186/s12544-018-0300-1. [63] H. K orkmaz and M. A. Erturk, “Prediction of the traffic incident duration using statistical and machine-learning methods: A systematic literature revie w , ” T echnological F orecasting and Social Change , art. no. 123621, 2024, doi: 10.1016/j.techfore.2024.123621. [64] R. Corbally , L. Y ang, and A. Malekjafarian, “Predicting the duration of motorway incidents using machine learning, ” Eur opean T ransport Resear ch Review , vol. 16, art. no. 14, 2024, doi: 10.1186/s12544-024- 00632-6. [65] National Center for Statistics and Analysis, National Highway Traf fic Safety Administration, “Speeding: 2023 Data, ” T raffic Safety F acts Re- sear ch Note , DOT HS 813 721. [66] Q. Xie, T . Guo, Y . Chen, Y . Xiao, X. W ang, and B. Y . Zhao, “Deep Graph Con volutional Networks for Incident-Driven Traf fic Speed Prediction, ” in Pr oc. 29th A CM Int. Conf. on Information and Knowledge Management (CIKM) , 2020, pp. 1665-1674, doi: 10.1145/3340531.3411873. [67] S. Chaudhuri, P . Juan, and J. Mateu, “Spatio-temporal modeling of traf fic accidents incidence on urban road networks based on an explicit network triangulation, ” Journal of Applied Statistics , vol. 50, no. 16, pp. 3229- 3250, 2023, doi: 10.1080/02664763.2022.2104822. [68] J. Jin, P . Liu, H. Huang, and Y . Dong, “ Analyzing urban traf fic crash patterns through spatio-temporal data: A city-lev el study using a sparse non-negati ve matrix factorization model with spatial constraints approach, ” Applied Geography , vol. 172, art. no. 103402, Nov . 2024, doi: 10.1016/j.apgeog.2024.103402. [69] P . T . Sav olainen, F . L. Mannering, D. Lord, and M. A. Quddus, “The statistical analysis of highway crash-injury se verities: A review and assessment of methodological alternativ es, ” Accident Analysis & Pr even- tion , v ol. 43, no. 5, pp. 1666-1676, 2011, doi: 10.1016/j.aap.2011.03.025. [70] P . Penmetsa and S. S. Pulugurtha, “Modeling crash injury se verity by road feature to improve safety , ” T raffic Injury Pr evention , vol. 19, no. 1, pp. 102-109, 2018, doi: 10.1080/15389588.2017.1335396. [71] A. Jamal, M. Zahid, M. T . Rahman, H. M. Al-Ahmadi, M. Almoshaogeh, D. Farooq, and M. Ahmad, “Injury se verity prediction of traffic crashes with ensemble machine learning techniques: a comparati ve study , ” Inter- national Journal of Injury Contr ol and Safety Pr omotion , vol. 28, no. 4, pp. 408-427, 2021, doi: 10.1080/17457300.2021.1928233. [72] X. Y an, J. He, C. Zhang, Z. Liu, C. W ang, and B. Qiao, “Spatiotemporal instability analysis considering unobserved heterogeneity of crash-injury sev erities in adverse weather , ” Analytic Methods in Accident Researc h , vol. 32, art. no. 100182, 2021, doi: 10.1016/j.amar.2021.100182. [73] Y . W u and J. J. Q. Y u, “ A Bayesian learning network for traffic speed forecasting with uncertainty quantification, ” in Pr oc. Int. Joint Conf. Neural Networks (IJCNN) , 2021, pp. 1-7. [74] Y . Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, ” in Pr oc. 33rd Int. Conf . Mach. Learn. (ICML) , PMLR, vol. 48, 2016, pp. 1050-1059. [75] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictiv e uncertainty estimation using deep ensembles, ” in Advances in Neural Information Processing Systems , vol. 30, 2017, pp. 6402-6413. [76] K. Stankevi ˇ ci ¯ ut ˙ e, A. Alaa, and M. van der Schaar, “Conformal time- series forecasting, ” in Advances in Neural Information Pr ocessing Sys- tems , vol. 34, 2021, pp. 11334-11344. [77] Y . Romano, E. Patterson, and E. J. Cand ` es, “Conformalized quantile regression, ” in Advances in Neural Information Pr ocessing Systems , vol. 32, 2019, pp. 3538-3548. [78] I. Gibbs and E. J. Cand ` es, “ Adaptive conformal inference under dis- tribution shift, ” in Advances in Neural Information Processing Systems , vol. 34, 2021, pp. 1660-1672. [79] M. Zaffran, O. F ´ eron, Y . Goude, J. Josse, and A. Dieule veut, “ Adaptive conformal predictions for time series, ” in Pr oc. 39th Int. Conf. Mach. Learn. (ICML) , PMLR, vol. 162, 2022, pp. 25834-25866. [80] Y . Nie, N. H. Nguyen, P . Sinthong, and J. Kalagnanam, “ A T ime Series is W orth 64 W ords: Long-term Forecasting with T ransformers, ” in Pr oc. Int. Conf. on Learning Representations (ICLR) , 2023. [81] J. Liu and W . Guan, “ A summary of traffic flo w forecasting methods, ” Journal of Highway and T ransportation Researc h and Development , 2004. [82] B. M. Williams and L. A. Hoel, “Modeling and forecasting vehicular traffic flo w as a seasonal ARIMA process: Theoretical basis and empirical results, ” Journal of T ransportation Engineering , vol. 129, no. 6, pp. 664- 672, 2003. [83] Y . Xie, K. Zhao, Y . Sun, and D. Chen, “Gaussian processes for short- term traffic volume forecasting, ” Tr ansportation Researc h Recor d , vol. 2165, no. 1, pp. 69-78, 2010. [84] D. A. Nix and A. S. W eigend, “Estimating the mean and variance of the target probability distribution, ” in Pr oc. 1994 IEEE Int. Conf. Neural Networks (ICNN’94) , vol. 1, pp. 55-60, 1994. 15 [85] A. J. Khattak, J. Liu, B. W ali, X. Li, and M. Ng, “Modeling traffic incident duration using quantile regression, ” T ransportation Research Record, vol. 2554, no. 1, pp. 139-148, 2016. [86] A. Matas, J.-L. Raymond, and A. Ruiz, “T raffic forecasts under uncer- tainty and capacity constraints, ” T ransportation , vol. 39, pp. 1-17, 2012. [87] M. Patil, P . Tulpule, and S. Midlam-Mohler , “ An approach to model a traffic en vironment by addressing sparsity in vehicle count data, ” SAE T echnical Paper 2023-01-0854, 2023. [88] Y . Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, ” in Proc. Int. Conf. Mach. Learn. (ICML) , PMLR, 2016, pp. 1050-1059. 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment