Forecasting mortality using Google trend
In this paper, the motility model for the developed country, which United State possesses the largest economy in the world and thus serves as an ideal representation, is investigated. Early surveillance of the causes of death is critical which can allow the preparation of preventive steps against critical disease such as dengue fever. Studies reported that some search queries, especially those diseases related terms on Google Trends are essential. To this end, we include either main cause of death or the extended or the more general terminologies from Google Trends to decode the mortality related terms using the Wiener Cascade Model. Using time series and Wavelet scalogram of search terms, the patterns of search queries are categorized into different levels of periodicity. The results include the decoding trend, the features importance, and the accuracy of the decoding patterns. Three scenarios regard predictors include the use of all 19 features, the top ten most periodic predictors, or the ten predictors with highest weighting. All search queries spans from December 2013 - December 2018. The results show that search terms with both higher weight and annual periodic pattern contribute more in forecasting the word die; however, only predictors with higher weight are valuable to forecast the word death.
💡 Research Summary
The paper investigates whether publicly available Google Trends search volumes for disease‑related terms can be used to forecast mortality‑related language (“die” and “death”) in the United States. The authors collected weekly Google Trends indices from December 2013 to December 2018 for 19 predictor terms that represent major causes of death identified by the World Health Organization (e.g., “AIDS”, “Alzheimer”, “Heart Disease”, “Cancer”, “Flu”, “Sick”) and for two dependent variables, the search terms “die” and “death”.
Each predictor time series was subjected to a Morlet wavelet transform to compute spectral power in the 0.5–4 cycles per year band (corresponding to periods between two years and three months). This analysis allowed the authors to rank the predictors by the strength of their annual periodicity.
For forecasting, a Wiener‑Cascade model was employed. The linear component is a multi‑input, single‑output filter with a 52‑week (one‑year) lag window, whose coefficients are estimated by ridge regression (regularization constant λ). The linear output is then passed through a static third‑order polynomial non‑linearity, yielding the final prediction. Model performance was assessed using five‑fold cross‑validation, with Spearman’s rank correlation (ρ) and mean‑squared error (MSE) as the primary metrics.
Three feature‑selection strategies were compared: (1) using all 19 predictors, (2) using the ten predictors with the strongest annual periodicity, and (3) using the ten predictors that received the highest weights in the full‑model solution. Results showed that the “die” term is more predictable than “death”. With all 19 predictors, “die” achieved ρ = 0.49 (p < 0.001) and MSE = 23.24, whereas “death” reached ρ = 0.25 (p < 0.001) and MSE = 52.61. When limited to the ten most periodic predictors, “die” retained a respectable ρ = 0.44 (p < 0.001) but “death” dropped to a non‑significant ρ = 0.07 (p = 0.32). Using the ten highest‑weight predictors restored “die” performance (ρ = 0.48, MSE = 25.50) and improved “death” to ρ = 0.31 (p < 0.001) with a lower MSE (47.93) than the full‑model case.
Weight analysis revealed that “Cancer”, “Diabetes”, and “Heart Disease” contributed most to forecasting “die”, while “Cancer”, “Sick”, and “Diabetes” were most influential for “death”. Notably, predictors with strong annual cycles but low model weights contributed little to the “death” forecast, underscoring that weight magnitude, rather than periodicity alone, drives performance for that target.
The authors conclude that Google Trends data contain actionable signals for mortality surveillance. Periodic disease‑related searches are especially useful for predicting generic mortality language (“die”), whereas the magnitude of each predictor’s contribution (as captured by model weights) is more critical for the term “death”. This distinction suggests that a future mortality‑monitoring system should tailor feature‑selection criteria to the specific linguistic endpoint of interest.
Limitations include the aggregation of the entire United States into a single time series (masking regional heterogeneity), the lack of direct validation against official mortality statistics (raising concerns about confounding influences such as media coverage or policy changes), and the reliance on a Wiener‑Cascade architecture that assumes a static non‑linearity, potentially missing more complex temporal interactions. Future work should explore regional disaggregation, incorporate more flexible deep‑learning time‑series models (e.g., LSTM, Transformer), and establish quantitative links between predicted search‑term trends and actual death counts to enhance external validity.
Comments & Academic Discussion
Loading comments...
Leave a Comment