How does the Performance of the Data-driven Traffic Flow Forecasting Models deteriorate with Increasing Forecasting Horizon? An Extensive Approach Considering Statistical, Machine Learning and Deep Learning Models
With rapid urbanization in recent decades, traffic congestion has intensified due to increased movement of people and goods. As planning shifts from demand-based to supply-oriented strategies, Intelligent Transportation Systems (ITS) have become essential for managing traffic within existing infrastructure. A core ITS function is traffic forecasting, enabling proactive measures like ramp metering, signal control, and dynamic routing through platforms such as Google Maps. This study assesses the performance of statistical, machine learning (ML), and deep learning (DL) models in forecasting traffic speed and flow using real-world data from California’s Harbor Freeway, sourced from the Caltrans Performance Measurement System (PeMS). Each model was evaluated over 20 forecasting windows (up to 1 hour 40 minutes) using RMSE, MAE, and R-Square metrics. Results show ANFIS-GP performs best at early windows with RMSE of 0.038, MAE of 0.0276, and R-Square of 0.9983, while Bi-LSTM is more robust for medium-term prediction due to its capacity to model long-range temporal dependencies, achieving RMSE of 0.1863, MAE of 0.0833, and R-Square of 0.987 at a forecasting of 20. The degradation in model performance was quantified using logarithmic transformation, with slope values used to measure robustness. Among DL models, Bi-LSTM had the flattest slope (0.0454 RMSE, 0.0545 MAE for flow), whereas ANFIS-GP had 0.1058 for RMSE and 0.1037 for flow MAE. The study concludes by identifying hybrid models as a promising future direction.
💡 Research Summary
The paper investigates how the forecasting accuracy of traffic‑flow prediction models degrades as the prediction horizon expands, covering statistical, machine‑learning (ML), and deep‑learning (DL) approaches. Real‑world traffic speed and flow data were obtained from the California Department of Transportation’s Performance Measurement System (PeMS) for the Harbor Freeway corridor. After cleaning and normalizing the 5‑minute interval time series, the authors constructed 20 forecasting windows ranging from 5 minutes to 100 minutes (1 hour 40 minutes) to evaluate short‑, medium‑, and long‑term performance.
Four families of models were examined. The statistical baseline comprised ARIMA and SARIMAX. The ML suite included Random Forest, XGBoost, Support Vector Regression, and a hybrid Adaptive‑Neuro‑Fuzzy Inference System optimized by Genetic Programming (ANFIS‑GP). The DL group featured standard LSTM, bidirectional LSTM (Bi‑LSTM), GRU, and Temporal Convolutional Network (TCN). All models were trained, validated, and tested under identical data splits, with hyper‑parameters tuned via cross‑validation. Performance was measured using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (R²).
Results show a clear divergence between early‑horizon and long‑horizon behavior. For the first three windows (5–15 minutes), ANFIS‑GP achieved the best scores (RMSE = 0.038, MAE = 0.0276, R² = 0.9983), indicating that the fuzzy‑rule‑based structure combined with evolutionary optimization captures rapid, nonlinear fluctuations effectively. However, its error growth accelerates sharply after the 30‑minute mark, making it less suitable for extended forecasts.
Conversely, among DL models, Bi‑LSTM demonstrated the most robust long‑term performance. Its bidirectional architecture leverages both past and future context, allowing it to model long‑range temporal dependencies. At the 100‑minute horizon, Bi‑LSTM recorded RMSE = 0.1863, MAE = 0.0833, and R² = 0.987, outperforming all other methods in the medium‑to‑long range. Standard LSTM and GRU performed reasonably well but exhibited steeper degradation curves.
To quantify degradation, the authors applied a logarithmic transformation to RMSE and MAE and fitted a linear regression against forecasting horizon. The slope of this regression served as a “robustness index.” Bi‑LSTM displayed the flattest slopes (0.0454 for RMSE, 0.0545 for MAE), confirming its resilience to horizon extension. ANFIS‑GP showed the steepest slopes (0.1058 RMSE, 0.1037 MAE), reflecting rapid loss of accuracy beyond short horizons.
The study therefore provides a practical decision framework: when operational needs focus on very short lead times (e.g., ramp metering, immediate signal adjustments), hybrid fuzzy‑evolutionary models like ANFIS‑GP are advantageous. For applications requiring medium to long lead times (e.g., dynamic routing, strategic traffic management), bidirectional recurrent networks such as Bi‑LSTM are preferable.
Finally, the authors argue that future research should explore hybridization of DL and fuzzy/evolutionary techniques, multi‑scale input representations, and the incorporation of exogenous variables (weather, incidents, events). Online learning and real‑time model adaptation are highlighted as essential for deploying these methods within live ITS environments. By systematically measuring performance decay across horizons, the paper offers actionable insights for transportation engineers, data scientists, and policymakers aiming to select or design the most appropriate forecasting tool for their specific temporal requirements.
Comments & Academic Discussion
Loading comments...
Leave a Comment