Analyzing the retraining frequency of global forecasting models: towards more stable forecasting systems
Forecast stability, that is, the consistency of predictions over time, is essential in business settings where sudden shifts in forecasts can disrupt planning and erode trust in predictive systems. Despite its importance, stability is often overlooked in favor of accuracy. In this study, we evaluate the stability of point and probabilistic forecasts across several retraining scenarios using three large forecastingdatasets and ten different global forecasting models. To analyze stability in the probabilistic setting, we propose a new model-agnostic, distribution-free, and scale-free metric that measuresprobabilistic stability: the Scaled Multi-Quantile Change (SMQC). The results show that less frequent retraining not only preserves but often improves forecast stability, challenging the need for frequent retraining. Moreover, the study shows that accuracy and stability are not necessarily conflicting objectives when adopting a global modeling approach. The study promotes a shift toward stability-aware forecasting practices, proposing a new metric to evaluate forecast stability effectively in probabilistic settings, and offering practical guidelines for building more stable and sustainable forecasting systems.
💡 Research Summary
The paper investigates how the frequency of retraining global time‑series forecasting models influences the stability of their predictions, both point‑wise and probabilistic. Forecast stability—defined as the temporal consistency of forecasts—has been largely ignored in favor of pure accuracy, despite its critical importance for business operations where sudden forecast shifts can disrupt planning, increase costs, and erode trust. The authors distinguish two dimensions of stability: vertical stability (consistency of forecasts for the same target date across different origins) and horizontal stability (smoothness across the forecast horizon). They focus on vertical stability because it directly reflects the impact of model updates.
To fill the methodological gap in evaluating probabilistic stability, the authors introduce a novel, model‑agnostic, distribution‑free metric called Scaled Multi‑Quantile Change (SMQC). SMQC normalizes multiple quantile forecasts to a common scale and computes the average absolute change between successive retraining origins, thereby capturing how much the predictive distribution drifts over time without relying on any distributional assumptions.
Empirically, the study uses three large, publicly available datasets: the daily subset of the M4 competition (4,227 series), the retail‑demand M5 competition (30,490 SKU‑level series), and the weekly VN1 competition (15,053 product series). Ten global forecasting models are evaluated, comprising five traditional machine‑learning algorithms (e.g., Random Forest, LightGBM) and five state‑of‑the‑art deep‑learning architectures (e.g., Temporal Fusion Transformer, N‑BEATS, DeepAR). For baseline comparison, two classic local models—ETS and ARIMA—are also included.
Five retraining regimes are examined: (1) continuous retraining at every new observation, (2) weekly, (3) monthly, (4) quarterly, and (5) no retraining after the initial fit. The authors assess point‑forecast accuracy using sMAPE/MASE and stability using SMQC for probabilistic forecasts and a simple inter‑origin difference for point forecasts.
Key findings are:
- Stability Improves with Less Frequent Retraining – SMQC scores consistently decline as the retraining interval lengthens, indicating that forecasts become more vertically stable when models are updated less often.
- Accuracy Is Largely Unaffected – Point‑forecast accuracy shows no systematic degradation with longer retraining intervals; in several deep‑learning models, accuracy even improves slightly, likely due to reduced over‑fitting and more stable parameter estimates.
- Global Models Outperform Local Counterparts – Across all datasets and retraining scenarios, global models achieve higher accuracy and lower SMQC values than ETS and ARIMA, demonstrating that shared‑parameter learning confers both predictive power and robustness to updates.
- Deep‑Learning Global Models Benefit Most – Models that can capture complex cross‑series patterns (e.g., Temporal Fusion Transformer) exhibit the greatest stability gains when retraining is infrequent, suggesting that their learned representations are less sensitive to incremental data changes.
The authors argue that the conventional practice of retraining models at every possible opportunity may be unnecessary, and even counter‑productive, for many operational settings. Reducing retraining frequency can lower computational costs, simplify model‑deployment pipelines, and deliver more consistent forecasts without sacrificing accuracy.
Practical guidelines derived from the study include:
- Choose a retraining cadence aligned with the volatility of the underlying data rather than a fixed “as‑soon‑as‑new‑data‑arrives” rule.
- Monitor SMQC (or similar stability metrics) alongside traditional accuracy measures to detect undesirable forecast drift.
- Prefer global modeling approaches when multiple related series are available, as they naturally mitigate instability caused by frequent parameter re‑estimation.
In conclusion, the paper provides the first comprehensive analysis of forecast stability for global models, introduces a robust metric for probabilistic stability, and demonstrates that less frequent retraining can yield more stable—and sometimes more accurate—forecasting systems. This work encourages the forecasting community to adopt stability‑aware evaluation practices and offers actionable insights for practitioners seeking sustainable, trustworthy predictive pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment