Cross-Country Learning for National Infectious Disease Forecasting Using European Data

Cross-Country Learning for National Infectious Disease Forecasting Using European Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate forecasting of infectious disease incidence is critical for public health planning and timely intervention. While most data-driven forecasting approaches rely primarily on historical data from a single country, such data are often limited in length and variability, restricting the performance of machine learning (ML) models. In this work, we investigate a cross-country learning approach for infectious disease forecasting, in which a single model is trained on time series data from multiple countries and evaluated on a country of interest. This setting enables the model to exploit shared epidemic dynamics across countries and to benefit from an enlarged training set. We examine this approach through a case study on COVID-19 case forecasting in Cyprus, using surveillance data from European countries. We evaluate multiple ML models and analyse the impact of the lookback window length and cross-country `data augmentation’ on multi-step forecasting performance. Our results show that incorporating data from other countries can lead to consistent improvements over models trained solely on national data. Although the empirical focus is on Cyprus and COVID-19, the proposed framework and findings are applicable to infectious disease forecasting more broadly, particularly in settings with limited national historical data.


💡 Research Summary

The paper introduces a cross‑country learning framework for national infectious‑disease forecasting, using the COVID‑19 daily case counts of Cyprus as a test case while leveraging surveillance data from 46 European countries. The authors argue that single‑country time‑series are often too short and lack variability, limiting the performance of data‑driven models. By pooling time‑series from multiple countries into a single training set, the model can capture shared epidemic dynamics and benefit from a larger, more diverse dataset.

Data were sourced from the Oxford COVID‑19 Government Response Tracker and the Cypriot Ministry of Health, covering 01‑Jan‑2020 to 31‑Dec‑2022. Missing reporting days were interpolated, counts were log‑transformed, and per‑country standardisation was applied for the Transformer model. Countries with more than one‑ninth missing days were excluded; the remaining nations were grouped by Spearman correlation with Cyprus (|ρ| ≥ 0.3, |ρ| < 0.3) and by inclusion of all countries.

Two forecasting horizons were examined: a look‑back window L of 7, 14, or 21 days and a fixed 7‑day ahead prediction horizon h. Input‑output pairs were constructed identically for each country, enabling a unified supervised learning problem. The study compared three families of models: (i) simple baselines (Naïve, Last‑Week‑Average, Seasonal Naïve, ARIMA), (ii) a gradient‑boosted tree ensemble (XGBoost), and (iii) a self‑attention based deep network (Transformer). XGBoost and Transformer were trained and evaluated over 15 random repetitions to capture variability.

Performance was measured with Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE) on a per‑day basis, as well as aggregated over the 7‑day horizon (MAPE 7‑day aggreg., MAE 7‑day aggreg.). Three train‑test splits were designed to test the models during different pandemic phases: (1) the first half of the largest wave plus the remainder, (2) low‑activity periods only, and (3) the later part of the series.

Results consistently showed that models trained on the pooled “All Countries” dataset outperformed those trained only on Cyprus data. For XGBoost with a 7‑day look‑back, the national‑only model achieved MAE = 431 and MAPE = 27.5 %, whereas the cross‑country model achieved MAE = 406 and MAPE = 28.3 % (the latter slightly higher but with lower absolute error). Across all look‑back lengths, the cross‑country approach reduced MAE by roughly 5–10 % and improved aggregated MAPE. The Transformer exhibited higher variance when trained on a single country but showed comparable or better performance when trained on all countries, indicating that its data‑hungry nature benefits from larger, heterogeneous datasets.

Sensitivity analysis of the look‑back window revealed that extending beyond 7 days did not guarantee better performance; in several cases, longer windows led to over‑fitting and higher errors, reflecting the rapid, non‑stationary dynamics of pandemic data.

The authors conclude that cross‑country learning acts as an effective data‑augmentation strategy, especially valuable for nations with limited historical records or during the early stages of an outbreak. They acknowledge potential drawbacks, such as heterogeneity in testing policies, reporting delays, and intervention measures across countries, which could introduce bias. Future work is suggested to incorporate explicit country identifiers, employ meta‑learning or multi‑task architectures to disentangle shared versus country‑specific patterns, and to integrate exogenous signals (mobility, weather, policy indices) for richer modeling. Overall, the study demonstrates that a simple pooled‑training approach can yield tangible gains in infectious‑disease forecasting accuracy without requiring complex multi‑task or graph‑based models.


Comments & Academic Discussion

Loading comments...

Leave a Comment