Reading time: 26 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.19737
  • Date:
  • Authors: Unknown

📝 Abstract

Reliable prediction of train delays is essential for enhancing the robustness and efficiency of railway transportation systems. In this work, we reframe delay forecasting as a stochastic simulation task, modeling state-transition dynamics through imitation learning. We introduce Drift-Corrected Imitation Learning (DCIL), a novel self-supervised algorithm that extends DAgger by incorporating distance-based drift correction, thereby mitigating covariate shift during rollouts without requiring access to an external oracle or adversarial schemes. Our approach synthesizes the dynamical fidelity of event-driven models with the representational capacity of data-driven methods, enabling uncertainty-aware forecasting via Monte Carlo simulation. We evaluate DCIL using a comprehensive real-world dataset from INFRABEL, the Belgian railway infrastructure manager, which encompasses over three million train movements. Our results, focused on predictions up to 30 minutes ahead, demonstrate superior predictive performance of DCIL over traditional regression models and behavioral cloning on deep learning architectures, highlighting its effectiveness in capturing the sequential and uncertain nature of delay propagation in large-scale networks.

📄 Full Content

Railway networks are vital infrastructure supporting sustainable, large-scale mobility globally, facilitating billions of passenger journeys annually. The extensive reliance on rail transport directly reflects service quality; hence, transport providers prioritize reliable, efficient, and user-friendly operations to fulfill passenger expectations. As a result, accurate delay prediction has become a critical research area, enabling commuters to anticipate disruptions and allowing operational personnel to proactively manage service impacts.

Following (Rößler et al. 2021), we distinguish primary delays, which stem from operational problems such as rolling stock failures, signal malfunctions, or severe weather, from secondary delays, the knock-on effects that propagate these initial disruptions as late-running trains interfere with downstream slot allocations. Due to the nature of accessible data, the delay prediction literature focuses mostly on modeling secondary delays, whose dynamics are governed by intricate spatiotemporal dependencies among interconnected services and infrastructure elements.

As described in (Spanninger et al. 2022), these approaches can be grouped into two categories. The first, event-driven approaches, capture the interdependence between arrival and departure events using stochastic models (Graph Models (Goverde 2010), Markov Chains (S ¸ahin 2017)), offering interpretability, uncertainty quantification and modest data requirements. The second, data-driven approaches, cast delay prediction as a supervised regression problem, allowing them to learn complex traffic dynamics from historical data using machine learning models (Linear Regression and Tree-based methods (Kecman and Goverde 2015), Neural Networks (Oneto et al. 2018), Transformers (Arthaud, Lecoeur, and Pierre 2024)). Yet event-driven dynamical models insufficiently represent complex interactions, while data-driven one-shot regressors may undervalue the inherent temporal dependencies of successive events.

To leverage the advantages of both methodologies, we propose to frame delay prediction as a stochastic simulation problem: a policy is trained by imitation learning to reproduce the state-transition dynamics p(s t+1 |s t ) observed in historical data. During roll-out, the policy stochastically predicts following states, enabling uncertainty quantification through Monte Carlo simulation. We combine the sequential nature of event-driven models with the representational power of data-driven methods to capture traffic complexity.

We propose Drift-Corrected Imitation Learning (DCIL), a self-supervised extension of DAgger’s dataset-aggregation approach (Ross, Gordon, and Bagnell 2011). Rather than relying on an expensive external oracle, DCIL applies distance-based drift correction during roll-outs by evaluating each candidate action’s induced next-state and choosing the one that minimizes a distance ψ(•, •) back toward the expert’s subsequent state. This strategy mitigates covariate shift without resorting to complex adversarial (Ho and Ermon 2016) or inverse reinforcement learning (Ng, Russell et al. 2000) frameworks, outperforming Behavioral Cloning (Torabi, Warnell, and Stone 2018).

Our evaluation uses an extensive real-world dataset derived from publicly accessible operational logs from IN-FRABEL, the Belgian railway infrastructure manager. This dataset encompasses over three million train operations across three years, covering 682 stations and diverse service types (regional, intercity, high-speed) under various conditions (peak, off-peak, weekday, weekend, disruptions). We benchmark simulation-based methods against regression across multiple architectures. Strikingly, we find that a 1.4-million-parameter Multi-Layer Perceptron trained with DCIL outperforms a 19-million-parameter Transformer trained with conventional regression on our task.

Our key contributions to delay forecasting are threefold:

  1. We model delay prediction in a stochastic simulation framework, focusing learning on short-term dynamics; 2. We propose DCIL, a novel self-supervised imitation learning algorithm that effectively addresses covariate shift through drift correction; 3. We conduct an extensive evaluation on a large-scale,

real-world open dataset for predictions up to 30 minutes ahead, demonstrating substantial improvements in predictive accuracy over regression approaches.

Let train i be present in the rail network at time t. We define its feature vector s Stations are embedded using the first eight non-trivial eigenvectors of the normalised graph Laplacian of the railnetwork graph (Belkin and Niyogi 2003), then per-node (row-wise) L2-normalised to yield eight-dimensional spectral coordinates that compactly encode topology-preserving neighborhoods. Line embeddings are obtained by averaging the embeddings of their constituent stations.

Let n denote the number of trains present at time t. Then, we define the full network state at time t as

i.e. all per-train states at time t.

Let the itinerary of train i be defined as a sequence of m stations (l 1 , . . . , l m ). We denote the scheduled passage time at station l j as τ (i) lj ∈ R and the actual passage time by τ (i) lj ∈ R. The resulting delay is Given the current network state s t , our goal is, for every train i and every future station l, to model the conditional distribution p d

or, in a point-prediction setting, some aspect of this distribution, such as its mean or median.

3 Background and Related Work

As noted in the introduction, train-delay research divides into two broad streams: event-driven approaches, which model the interdependence between arrival and departure events, and data-driven approaches, which learn predictive models directly from observations (Spanninger et al. 2022).

Recent years have seen a surge in the latter, with studies exploring deep-learning architectures-including Convolutional Neural Networks and bidirectional LSTMs (Guo et al. 2022), Graph Neural Networks (Heglund et al. 2020), andTransformers (Arthaud, Lecoeur, andPierre 2024). Although these models outperform simple baselines they are compared to, they are evaluated on private datasets, so results cannot be compared across studies.

To our knowledge, (Yang et al. 2024) presents the first large-scale head-to-head comparison of data-driven methods for train-delay prediction. They find that transformer-based architectures perform best on an open-source dataset covering two Chinese high-speed routes.

Datasets Only a handful of refereed papers rely on public delay data. The best-curated example is the HSR-DELAY corpus (3399 trains, 727 stations, Oct 2019-Jan 2020) released in Scientific Data (Zhang et al. 2022). However, the dataset spans only four months on a limited number of trains, and the follow-up study of (Yang et al. 2024) evaluates just two lines. (Dekker et al. 2022) model delay using diffusion on the Belgian open-data feed (same dataset as our paper), but do not predict delays for individual trains; instead, they simulate the evolution of delay across clusters of stations over time. (Lapamonpinyo, Derrible, and Corman 2022) scrape the U.S. Amtrak API for a single corridor. To our knowledge, no train delay prediction study yet covers an entire mixed-traffic national network over multiple years, leaving generalisation beyond high-speed or single-corridor settings largely untested.

Past work on railway network simulation relies on handcrafted microscopic, mesoscopic, and macroscopic simulators such as RailSys (Bendfeldt, Mohr, and Muller 2000), OpenTrack (Nash and Huerlimann 2004), PETER (Koelemeijer et al. 2000), or PROTON (Sipilä 2023) that encode operating rules and dispatcher heuristics instead of learning from historical data. Macroscopic simulators neglect precise train dynamics but scale well to large networks, whereas microscopic ones demand extensive calibration and significant compute, and mesoscopic models target only critical sections (Tiong and Palmqvist 2023). Consequently, forecasts from these rule-driven tools often fail to match real-world delay patterns, especially on busy networks.

The work on rail-network simulation reflects that of the wider time-series forecasting literature. Train delay is deterministically linked to train trajectory (i.e., delay at t can be derived from state s t ), with trajectory prediction being a specific case of time-series forecasting (cf. Eq. ( 1)).

A traditional approach to time series forecasting, often used in the context of signal processing, is to implement a dynamical model p(s t+1 | s t ) whose parameters have a physical meaning (speed, acceleration, etc.), e.g. (Prevost, Desbiens, and Gagnon 2007;Martino et al. 2017).

Specifying reliable dynamical models with an expert is not always feasible (particularly with complicating external factors such as human behaviour involved), but such models can instead be learned in a data-driven (and often, modelagnostic) approach via machine-learning; essentially learning p(s t+1 | s t ) or similar from training pairs {(s t , s t+1 )} t .

Data-driven simulation approaches have shown their efficiency in predicting future events in various domains. In weather forecasting, (Price et al. 2025) introduced GenCast, a conditional diffusion model trained on decades of reanalysis data that generates 15-day global ensemble forecasts in minutes and surpasses the leading physics-based system (ECMWF-ENS) on 97% of evaluation targets. In road-traffic modeling, (Kuefler et al. 2017) applied Generative Adversarial Imitation Learning to train a driver-behavior simulator that reproduces realistic lane changes, speed profiles and collision-free trajectories, showing that imitation-learning-based simulation can successfully predict future traffic events.

Imitation learning seeks a policy π θ that reproduces expert behaviour from demonstration trajectories τ = (s 0 , a 0 , . . . , s T , a T ). Four families dominate the literature: Behavioral Cloning (BC). Treats imitation as supervised learning: fits π θ (s) to expert actions a on recorded (s, a) pairs. Fast and data-efficient but prone to covariate shift once the learner visits states absent from the dataset. Dataset Aggregation (DAgger). Iteratively executes the current policy, queries the expert on the visited states, and augments the training set. This feedback loop curbs covariate shift but can be expensive when expert labels are costly. Inverse Reinforcement Learning (IRL). Alternates between (i) fitting a reward function under which the expert is (near-)optimal and (ii) training the policy via RL using learned rewards. This bi-level loop generalises beyond the demonstration manifold but is computationally heavy. Generative Adversarial Imitation Learning (GAIL). Derived from IRL, GAIL sets up a GAN-style game: a discriminator distinguishes expert from learner state-action pairs, and its negative log-probability serves as the reward fed to the policy optimiser (e.g. TRPO/PPO). This removes the expensive bi-level loop, but inherits typical GAN instabilities such as mode collapse.

We frame the railway network as a Markov Decision Process (MDP). Let s t = (s

t ) be the state of the rail network at time t, defined in Section 2. Let A denote the action space, with a t = (a

t ) the actions describing the movements of trains on the network at time t. As illustrated in Figure 1, action a (i) ∈ {0, 1, 2} moves train i by that many scheduled stations during the next time step, with the cap at 2 ensuring a finite discrete action space. In this work, we set ∆t = 30 seconds, so s t+1 represents a full snapshot of the network 30 seconds after s t . No hand-crafted block-constraint rules are enforced; instead, the policy approximates them implicitly by imitation, as a precise specification is impractical. Actions and states remain discrete to fit available data, a continuous version would require GPS.

The environment dynamics, applied per train, are defined by the function ϕ :

For simplicity, we still denote the transformation as ϕ(s, a), mapping a set of train-action pairs (s, a) = ((s (1) , …, s (n) ), (a (1) , …, a (n) )) to their next state (ϕ(s (1) , a (1) ), …, ϕ(s (n) , a (n) )). As a result, we obtain

whose sum has at most one non-zero term since ϕ is injective for a fixed s: by construction, distinct actions map to distinct states, with π(a|s)

As delay can be deduced from states, we rewrite delay prediction as prediction of future states:

where f retrieves the delay from the predicted states. Via Markov assumption:

Exact evaluation is infeasible, so we approximate the distribution with Monte Carlo rollouts of policy π. In this work, we obtain the delay point forecast by retrieving the median of the empirical distribution from the sampled trajectories.

In summary, we have reduced delay prediction to learning a policy π that approximates p(a|s). The following Section presents the process of learning said policy.

We assume an (implicit) expert policy π * has generated a logged dataset of train trajectories,

The term “expert” is purely notional: it refers to the behavioural patterns encoded in historical train-movement data rather than to a conscious decision-maker. Each recorded trajectory is treated as if it were sampled from π * .

Our goal is to learn a policy π θ via imitation learning that closely matches the expert action distribution π * (a t | s t ).

The standard method within Imitation Learning is Behavioural Cloning (BC): it trains a parametric policy π θ by minimising the cross-entropy between the predicted action distribution π θ (•|s) and the one-hot expert labels from the demonstration set D. However, this method is notoriously prone to covariate shift: prediction errors push the policy into unseen states, where its performance quickly degrades. In order to overcome this limitation in our setting, we propose Drift-Corrected Imitation Learning (DCIL), a self-supervised extension of DAgger that injects simulator-generated corrective labels during training, counteracting covariate shift without requiring any additional expert queries or complex adversarial schemes.

Starting from a state s 0 drawn from a demonstration trajectory (s 0 , . . . , s T ) ∈ D, we roll out our policy π θ for T steps to obtain a policy trajectory (s 0 , s

t+1 , ϕ(s

where ψ : S (i) × S (i) → R >0 is a distance in state space.

In this work, ψ computes the number of stations separating the two individual states s (i) and s ′ (i) along the itinerary of train i. At each step t, we choose for each train the action that minimizes the future distance, i.e., the action a * (i) is 0 when the train is at the right station or ahead, 1 when it is one station behind, and 2 when it is 2 or more stations behind.

This algorithm is off-policy. As a result, we store synthetic state-action pairs in a replay buffer and use them for multiple epochs to improve sample efficiency.

Because synthetic labels become less reliable as the rollout drifts farther from the expert trajectory, we down-weight their influence according to the distance between the current step t policy states s

and the next expert state s (i) t+1 . One SGD step on the cross-entropy loss uses the scaled gradient

with hyperparameters α > 0 and β ≥ 1. Thus, trajectories very close to the expert receive nearly full weight, while poorly matching ones contribute only marginally.

The full training loop, described in Algorithm 1, starts with initializing π θ and B. Then, for E epochs, it (i) rolls out a * t : a * (i) t = arg min a∈A (i) ψ(s

t+1 , ϕ(s

t , a))

10:

end for 14:

end while 15:

for each mini-batch M ⊂ B of size m do 16:

end for 19: end for the current policy to inject n s synthetic samples, discarding the oldest to keep the buffer at capacity C, and (ii) performs Adam updates on mini-batches drawn from the entire buffer B, where each gradient is down-weighted according to the distance-based weight in Eq. (2).

Our empirical evaluation uses raw operational logs provided by INFRABEL, the Belgian railway infrastructure manager, covering a three-year period from 1 January 2022 to 31 December 2024. Every 30 seconds, we build a network-wide snapshot that contains the vector encoding of each train, as per Section 2. Network density varies from fewer than 10 trains during late-night service to more than 400 at peak hour. To augment the schedule context, we insert each train at a placeholder station 5 min before its planned departure and keep it at a placeholder station for 5 min after its final observed stop. Records with missing or incoherent timestamps are discarded (less than 1% of trains). Finally, we uniformly subsample 10% of the snapshots, yielding 255 k snapshots with 51 million train instances in total.

We adopt a strict temporal split to avoid information leakage and match industrial settings with a full calendar year for testing to maximise seasonal and operational diversity: Evaluating a single snapshot with the simulation-based approach requires sampling 50 trajectories and simulating 66 time steps; on an NVIDIA A100 this takes about 1.5 seconds of wall-clock time. To keep evaluation tractable, we down-sample the test split to 800 snapshots (≈ 200 trains per snapshot), yielding 1.6 million arrival-time predictions (≈ 10 per train) that are produced in roughly 20 min end-to-end. Test snapshots are uniformly drawn, but busy periods produce more predictions, effectively emphasizing peaks.

Reproducibility. All experiments are fully reproducible; code and appendix are available on GitHub (link on the first page).

In this work, we are interested in multi-station delay forecasting: i.e., predicting delay for the next n stations. This section provides implementation details for the evaluated methods (regression, BC, DCIL) and architectures (XG-Boost, MLP, Transformer).

Regression target Following Section 2, we model delay via the conditional distribution p d (i) l s t . For a train i travelling on itinerary l and currently located at station index j, we need a forecast for the next n stations. Rather than predicting absolute delays, we regress on the difference to the last known delay

Our model outputs a point estimate ∆ d(i) lj+1:lj+n conditionally on s t ; we then recover absolute delays by adding the last known delay d (i) lj , yielding the prediction d(i) lj+1:lj+n . Ensuring fair comparison To guarantee that regression and simulation models are evaluated on exactly the same targets, we restrict the predictive horizon to H = 30 min. In the final metrics, we therefore retain only station events whose observed arrival time lies within H of the reference snapshot s 0 . Because the regression baselines output a fixed-length vector, we set the output length to the next K = 15 stations-a compromise that covers almost all instances within 30 min while avoiding an unnecessarily large head that would penalise regression. The simulator is forced to respect the same limit and produces at most K predictions per train. Should a simulated train fail to reach one of the K future stations within the simulated steps (e.g. it keeps choosing STAY: a = 0), we copy forward its most recent known delay and, if necessary, enlarge it so that the predicted arrival time never precedes the last simulated time step. To allow the simulator to make mistakes by predicting arrival times beyond the predictive horizon, we simulate 10% more time steps.

Method-dependent station/line horizon The raw features of Section 2 are computed for the previous five visited stations and lines for all models. To capture information about upcoming stations and lines, we include the next five scheduled stations/lines when the input feeds the simulator and the next fifteen for a regression model.

Mitigating Stalled Policies in BC During roll-out, we observed that the BC learned policy occasionally stalls: it assigns almost all probability mass to the STAY action (a = 0) at every step, so the train never advances. We keep for each train a running floor µ, defined as the largest value of π(a=1) since that train last advanced. At every step, we clamp π(a=1) ← max π(a=1), µ and then renormalise the action distribution; this simple trick reduces the fraction of stalled trains and improves long-horizon accuracy. This isn’t necessary for DCIL.

To highlight the contrast between regression and simulation paradigms, we benchmark three architectures of ascending expressive power: a classical gradient-boosted tree ensemble (XGBoost), a lightweight multi-layer perceptron (MLP), and an encoder-only Transformer.

Transformer. Following (Arthaud, Lecoeur, and Pierre 2024), each input token represents one train, i.e. s (i) t defined in Section 2. Through the attention mechanism, the model can propagate information between trains, predicting: (i) delay for regression and (ii) probability over the action space for simulation, conditionally to the state of the network.

MLP/XGBoost Due to the fixed-length input nature of both the multi-layer perceptron (MLP) and XGBoost, we cannot feed them the variable-size set of trains present in a network snapshot the way the Transformer does. As a result, the model inputs are train-specific rather than snapshotspecific.

In order to give the model network-related information, we build features that we concatenate with those introduced in Section 2. For each train, we compute, at the five radii r ∈ {0.1, 0.3, 0.6, 1.0, 2.0} (chosen ad hoc to capture different ranges of network interactions), 1. the count of neighbouring trains lying within Euclidean distance r in the embedding space, min-max normalised 2. the mean of the past delays of trains in the neighborhood These ten scalars provide a compact summary of the local traffic context; appending them to the base feature vector yields the fixed-length representation required by both the MLP and XGBoost models.

Hyperparameter tuning Table 1 and Table 2 are produced with the following two-phase protocol. First, for each architecture we carry out a model-specific, multi-stage grid search over a subset of hyper-parameters, using mean absolute error (MAE) on the validation set as the selection metric. Second, the best configuration is run ten times with different random seeds; each run is retrained on the combined train and validation sets and evaluated on the held-out test set. These runs are compiled using mean ± standard deviation across metrics. For Transformer, moving from pure Regression to the simulation baseline BC already yields a clear improvement: MAE drops by 7.2% and RMSE by 8.1%. DCIL amplifies these gains, ultimately reducing MAE by 13.7% and RMSE by 17.1%, respectively, relative to Regression, and by roughly half of that again relative to BC. The accompanying lower standard deviations hint at a stabler optimisation compared to BC when trajectory-level feedback is used. Regression yields surprisingly worse results for RMSE, hinting at a lack of robustness to outliers.

The same pattern holds for the MLP. BC narrows the gap to Regression (-8.1% MAE and -4.4% RMSE), while DCIL delivers the best absolute scores and the lowest variance (-10.2% MAE and -6.2% RMSE versus Regression). Thus, even a lightweight neural network benefits from simulation. Noticeably, the gap between BC and DCIL is smaller.

In contrast, the tree-based XGBoost does not profit from the imitation-learning signal. Switching from Regression to BC actually increases the errors by 3-5% across the board. DCIL was not evaluated here because the iterative training scheme is not compatible with XGBoost.

With DCIL, the Transformer achieves the lowest overall errors (MAE = 52.24, RMSE = 96.34). Interestingly, even the simpler MLP architecture (1.4M parameters), when trained with imitation learning, surpasses the Transformer (19M parameters) trained purely by regression: MLP-BC attains an MAE of 59.10 s and an RMSE of 107.29 s, while MLP-DCIL improves further to 57.76 s and 105.23 s, respectively-both lower than the Transformer-Regression baseline (MAE = 60.53 s, RMSE = 116.21 s).

Table 2 breaks the MAE down into six 5-minute predictive horizon bins. Predictions in the 0-5 minutes bin correspond to events where the observed arrival time is within 0 to 5 minutes of the snapshot. As a result, the trends observed in the aggregate metrics can be more sharply analysed.

For Transformers, BC reduces MAE by 21.2% at 0-5 min to 2.9% at 25-30 min relative to Regression; DCIL further cuts the error by 27.2% and 8%, respectively, delivering the lowest values in every horizon. These gains indicate that DCIL has an advantage over BC, most notably in longer horizon predictions, highlighting covariate shift mitigation.

With MLP models, BC reduces MAE by 23.4% at 0-5 min to 2.8% at 25-30 min compared with Regression; DCIL brings the gains to 26.3% and 4.3%, respectively, mirroring the Transformer’s pattern.

Using XGBoost models, BC reduces MAE by 10.7% at 0-5 min to 1% at 5-10 min, but increases it by 2.5% at 10-15 min to 9.8% at 25-30 min relative to Regression. It performs better for short-range predictions and worse for the rest. Within the simulation setting, XGBoost’s errors grow faster with horizon than the deep models’, indicating a treespecific sensitivity to distribution shift rather than a limitation of the simulation framework. In particular, the XGBoost policy is not explicitly trained as a stochastic policy or calibrated probabilistic model, so treating its outputs as action probabilities can amplify compounding errors under simulation.

DCIL with the Transformer remains the overall winner, yielding the best MAE across all predictive horizons. Notably, MLP-BC beats Transformer regression up until 20 minutes, while MLP-DCIL beats it all the way to 25 minutes, despite a much simpler model complexity. Taken together with the aggregate results, these horizon-wise findings confirm that simulation is more effective than regression for 30-minute delay prediction, and that DCIL effectively reduces the effect of covariate shift.

As noted in Section 3, imitation learning is not limited to behavioural cloning. Generative Adversarial Imitation Learning (GAIL) is the standard non-BC alternative. Yet all of our GAIL runs with Transformer models collapsed: despite using PPO and applying stability tricks such as label smoothing, policy BC pretraining, gradient-norm clipping and reward shaping-the policy converged to a single constant action prediction, producing high validation MAE. We suspect the discriminator’s reward couples token (train) contexts in the Transformer, creating noisy, non-local gradients on individual actions. Decoupling state-action pairs could stabilise learning, but would discard key inter-train dependencies. This failure illustrates the complexity of adversarial imitation learning and motivates simpler, more stable training schemes such as DCIL that still counteract covariate shift.

Experimental results outline the superiority of simulation for train delay prediction compared with regression for deep learning methods. Most notably, simpler neural networks trained with an imitation learning scheme have better results than a Transformer using regression despite having ≈ 14× fewer parameters and a less expressive architecture. This highlights that the training objective, rather than raw model capacity, is the primary lever for accuracy on this task.

Additionally, DCIL’s advantage over BC is more pronounced for Transformers than for MLP. This hints that Table 2. Test-set MAE (mean ± std) for different predictive horizons (5-minute bins) DCIL scales with model expressiveness: the richer inductive bias of self-attention allows the Transformer to exploit the trajectory-level gradients that DCIL provides, whereas the lower-capacity MLP already harvests most of the benefit from simply matching expert actions step-by-step.

A natural next step is to test whether the simulation-based objectives retain their edge when labelled data are scarce. Because DCIL supplies a trajectory-level reward that is generated on-the-fly by the simulator, it can augment each real sequence with many synthetic roll-outs, effectively multiplying the learning signal. We therefore hypothesise that, under progressive down-sampling of the training set, the gap between DCIL and both BC and regression will widen-especially for the Transformer, whose larger capacity typically demands more data.

Table 2 shows that the absolute MAE difference between Regression and BC narrows as the forecast horizon grows (e.g. 6.5 s at 0-5 min vs. 2.6 s at 25-30 min). In contrast, the Regression-DCIL gap stays roughly constant (≈ 8 s), indicating that DCIL controls horizon-dependent drift more effectively than BC. Investigating whether the same trend holds for horizons longer than 30 minutes is an interesting direction for future work. Following the procedure presented in (Kuleshov, Fenner, and Ermon 2018), we produce the calibration plot Fig. 2; prediction intervals are formed from rollout percentiles, and their reliability is evaluated via percentile-based PIT calibration. The coverage curve lies consistently below the diagonal, indicating that the model is moderately overconfident: a nominal 80% prediction interval, for instance, contains the ground-truth delay only about 70% of the time. We attribute this under-coverage to difficulty modelling extremedelay events-rare but operationally important-which the ensemble rarely samples, causing mass to concentrate in the right-most bin and leaving earlier bins under-populated. We hypothesise that this may, in part, stem from inputs that only weakly encode the precursors of extreme delays.

In this paper, we framed probabilistic delay prediction as a sequential simulation task and trained an autoregressive policy via imitation learning using Behavioural Cloning and our novel approach, Drift-Corrected Imitation Learning (DCIL), a self-supervised extension of DAgger that curbs the covariate shift that plagues Behavioural Cloning without relying on an external oracle or complex adversarial schemes.

Through extensive evaluation on large-scale real-world data from INFRABEL, covering over three million train movements spanning three years, DCIL demonstrated superior performance compared with traditional regression methods and Behavioural Cloning on deep learning architectures for all predictive horizons. Future work could refine the simulation by incorporating higher-resolution representations of the railway network, leveraging GPS-based data for added detail. A second line of inquiry involves investigating DCIL’s applicability to other imitation learning scenarios beyond delay prediction, assessing its generalisability and robustness across diverse domains. Furthermore, examining the theoretical and practical connections between DCIL and reinforcement learning-particularly model-based reinforcement learning-could provide valuable insights. Additionally, investigating the integration of DCIL within modelpredictive control frameworks, specifically regarding performance and scalability over extended prediction horizons, presents a promising avenue for enhancing operational decision-making in complex, dynamic environments.

Data processing and experiments are run in a high performance cluster using Linux and Slurm.

Data processing Data processing is conducted using 40× Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz processors and 200Go of RAM totalling 15 hours.

Transformer Transformer experiments are conducted using A100 gpus, 8× EPYC 7543 Milan AMD processors and 64Go of RAM totalling 2000 hours.

MLP/XGBoost MLP and XGBoost experiments are conducted using V100 gpus, 10× Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz processors and 40Go of RAM totalling 1500 hours.

Across all three model families-Transformer, Multi-Layer Perceptron (MLP), and XGBoost-we perform a three-stage grid search. At each stage, we sweep the hyperparameters listed in Tables 34-5 over all Cartesian products, fixing all other settings to the best configuration from the previous stage. Validation uses mean absolute error (MAE) on the delay prediction targets, with a varianceaware criterion: if two candidates have similar MAE, the one with lower training variance is selected. For Transformers and MLPs, we apply early stopping on the validation MAE with patience equal to 0.25 of the maximum epoch count (e.g., 20 epochs for an 80-epoch run).

After tuning, the best configuration is retrained on the union of train and validation data and evaluated on the test split using ten random seeds (0-9), with all randomness controlled via PyTorch Lightning’s global seeding. In all tables, a dash (-) indicates that the field is not applicable to the corresponding method.

Transformer. The Transformer sweep supports Regression, Behavioural Cloning (BC) and Drift-Corrected Imitation Learning (DCIL). All variants share the optimiser and architectural defaults listed at the top of Table 3. Phase 1 explores model dimension, number of layers and learning rate, Phase 2 fine-tunes dropout, batch size and learning rate, and Phase 3 (DCIL only) searches over trajectory length, α and β.

MLP. The MLP sweep supports Regression, Behavioural Cloning (BC) and Drift-Corrected Imitation Learning (DCIL). All variants share the optimiser and architectural defaults listed at the top of Table 4. Phase 1 explores hidden dimensions sizes and learning rate, Phase 2 fine-tunes batch size and learning rate, and Phase 3 (DCIL only) searches over trajectory length, α and β.

XGBoost. The XGBoost sweep supports Regression and Behavioural Cloning (BC). All variants share the optimiser and architectural defaults listed at the top of Table 5. Phase 1 explores gamma, max depth, min child weight, subsample and colsample by tree and Phase 2 fine-tunes learning rate, number of estimators, reg α and reg λ.

The final configurations are given in

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut