Ensemble Learned Vaccination Uptake Prediction using Web Search Queries

Ensemble Lear ned V accination Uptake Prediction using W eb Sear ch Queries Niels Dalum Hansen Unive rsity of Copenhagen IBM Denmark nhansen @di.ku.dk Christina Lioma Unive rsity of Copenhagen c.lioma@di.ku.dk Kåre Mø lbak Statens Serum Institut KRM@ssi.dk ABSTRA CT W e present a metho d that uses en sem ble learning to com- bine clinical and w eb-mined time-series data in order to pre- dict futu re v accination u ptak e. The clinical data is oﬃcial v accination registries, and the w eb data is query frequ en - cies collected fro m Go ogle T rends. Exp erimen ts w ith oﬃcial v accine records show that our metho d predicts v accination uptake eﬀectiv ely (4.7 Ro ot Mean Squared Error). Whereas p erforma nce is b est w hen com b ining clinical and web data, using s olely web data yields comparative p erformance. T o our k no wledge, this is the ﬁrst study to predict v accination uptake using web data (w ith and w ithout cl inical data). 1. INTR O DUCTION AND RELA TE D WORK Predicting pu blic health events, e.g. h o w many people ma y get v accinated in the near future, can reduce the re- action time of pub lic health professionals , resulting in more eﬃcien t serv ices and improv ed public health. T raditionally , public health even t prediction rel ied on clinic al data (e.g. mi- crobiologica l results or patient registries) th at was collected from designated b odies. In the last decade ho w ever, n on- clinical web data (e.g. search engine qu eries or microblog messages), h as b een shown useful to th e task of p redicting public health events. Cl inical and w eb data are complemen- tary sources of ev idence: Whereas clinical data contributes exp ert and curated information to the p red iction, web data contri butes n ear real-time information on a large scale ab out e.g. symptoms or health co ncerns that may go undetected or un reported by th e oﬃcial clinical channels. W e p resen t a metho d for predicting v accination uptake by combining clinica l and w eb d ata using ensemble learning. Com bining such clinical and web search data fo r v accination uptake p red iction is nov el. So far, researc h on vacci nation uptake has focu sed on the eﬀect of physici an recommen- dations on v accination uptake [4]; how com bined sources of information (e.g. physician, television, friends) inﬂu ence p eople’s decisions ab out v accination [5]; and the eﬀects of media cov erage on va ccination u ptak e with resp ect to in- Permission to make digital or hard copies of all or part of this work for personal or classroom us e is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full cita- tion on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. T o copy ot herwise, or re- publish, to pos t on servers or to redistribute to lists , requires prior spec iﬁc permission and /or a fee. Request permissions from permissions@ acm.org. CIKM’16 , Octobe r 24-28, 2016, Indianapolis, IN, USA c  2016 ACM. ISBN 978-1-4503-4073-1/16/10. . . $15.00 DOI: h ttp://dx.doi.org/10.1145 /2983323 .2983882 ﬂuenza v accination [11], H PV v accination [8 ], and MMR v accination [17]. T o our knowl edge, our stud y is t h e ﬁrst to predict v accination uptake using web data (with and with- out clinical data). W eb and/or clinical data hav e b een used b efore for other types of health event pred ictions, e.g. inﬂuenza activity [6, 12, 13, 14, 16, 19], dengue fever [1] and cholera [3]. How the diﬀerent typ es of d ata should b e hand led has evolv ed from using a uniﬁed mo del for b oth web and clinical data [9, 18 ], to using ensem ble metho ds t hat mod el separately clinical and w eb data and then combine the outp uts [15]. When web search query frequencies are u sed for prediction [15, 16, 18], a single linear mo del is used to combine the query frequencies into a prediction. Metho ds using qu ery frequencies select queries either by (i) timely correlation be- tw een query search frequency and the health even t [6, 15, 16, 18], or by (ii) exp ert selection of queries [1 , 14, 19]. Both approac hes hav e disadva ntag es. App roac h (i) relies on cal- culating the correlation b et w een th e health even t time-series and all queries, which is compu tationall y exp ensiv e. It also assumes that historic correlation equals predictive p o wer in the future, which m ay not alwa ys b e the case. Approach (ii) relies on human ex perts, which is costly and d oes not scale we ll. In this work we p ropose a t hird approach: W e select q ueries based on web descriptions of the health even t, in our case of the va ccine in question, an d we use an en- sem ble learning approac h , sp eciﬁcally stacking, to predict v accination uptake. 2. ENSE MBLE LEARNING PREDICTIO N V accination uptake prediction with time-series d ata can b e form ulated as: ˆ E ( t ) ≈ E ( t − 1), where ˆ E ( t ) is the pre- dicted v accination uptake at time t , and E ( t − 1) is the observed v accination uptake at time t − 1. W e compute ˆ E ( t ) using ensemble learning by combining separate p redic- tions on v accination u ptak e based on clinical and web data into one pred iction. Ensemble learning com bines predictions from an ensemble of level-0 m o dels into one prediction us- ing a level-1 meta mo del. W e use an ensem b le metho d called stacking . First, all level-0 mo d els are trained. Then, a level - 1 mod el is trained to make a ﬁnal prediction using all the predictions of the level -0 mo d els as input. W e exp erimen t with th ree diﬀerent typ es of leve l-1 mo dels: a linear mod el, supp ort vector regression (SV R) with a linear kernel, and SVR with a Gaussian kernel. Both our clinical and w eb data are time-series, i.e. eac h data p oin t h as a temporal ref- erence. 2.1 Le vel-1 models Stac king with l inear mo del. W e deﬁne a linear mo d el with tw o explanatory va riables: ˆ E ( t ) = µ + β 1 ˆ E c ( t )+ β 2 ˆ E w ( t ), where ˆ E c ( t ) is the prediction based on clinical data at time t , ˆ E w ( t ) is the prediction based on web data at time t , and µ , β 1 and β 2 denote t he co eﬃcien ts th at need to b e optimized. W e use ordinary least squares to ﬁnd th e co eﬃcien ts t hat minimize: min µ,β 1 ,β 2 P t  E ( t ) − µ − β 1 ˆ E c ( t ) − β 2 ˆ E w ( t )  2 . Stac king with SVR. SVR solv es the same p roblem as the linear model presented ab o ve, b ut with the p ossibilit y of using kernels t o transform the input in to another feature space. In addition µ , β 1 and β 2 are selected to minimize the followi ng: min µ,β 1 ,β 2 P t V ( E ( t ) − µ − β 1 ˆ E c ( t ) − β 2 ˆ E w ( t )) + λ 2 ( µ 2 + β 2 1 + β 2 2 ), where λ is a hyperp arameter controlling the p enalty for large co eﬃcien ts, and V ( r ) is deﬁned as 0 if | r | < ǫ and otherwise | r | − ǫ . The parameter ǫ controls how precise the prediction has to b e b efore it is treated as correct. W e exp erimen t with an SVR with linear kernel an d with a Gaussian kernel deﬁned as: K ( x, x ′ ) = exp( − γ || x − x ′ || 2 ), where γ is a hyperparameter. 2.2 Le vel-0 models Prediction with clini cal data. As level-0 mod els w e use three well -known time-series metho ds: autoregressive (A R) mod els [18 , 9], ARIMA and Holt Winters (H W). AR mo dels estimate ˆ E ( t ) as: ˆ E ( t ) = µ + P m i =1 β i E ( t − i ) where m is the num b er of autoregressiv e terms, µ is the in- tercept, and the β s control the wei ght that each past obser- v ation has on the prediction. AR mo dels ass ume that future v alues of E can b e predicted by a linear combination of th e m most recen tly observed v alues of E . With enough autore- gressiv e terms AR mo dels can han d le seasonal changes, but not general upw ards or d o wnw ards trends. An ex tension of the AR mod els are th e ARIMA (Au - toRegressiv e Integ rated Moving Average) mo dels. In ad- dition to the autoregressiv e terms, these mo dels also include a mo ving av erage, which is a weigh t ed sum of the q most recent forecasting errors. Let m denote the number of au- toregressiv e terms and q the number of moving a verag es; then: ˆ E ( t ) = µ + P m i =1 β i E ( t − i ) + P q j =1 φ j ǫ t − j + ǫ t , where ǫ t = E ( t ) − ˆ E ( t ). T o handle trend , the original signal E can b e diﬀerentiated one or more times [2]. HW forecas ting is deﬁned by three recursiv e equations contro lling: level, trend and seasonalit y . H W can forecast time-series with b oth trend and seasonal changes. Eac h equation is d eﬁned as a wei ghted sum in which the weigh t of historic observ ations decreases exp onentia lly with time. HW forecasting with level, trend and seasonalit y is recur- sive ly d eﬁned as: leve l a t = α ( E ( t ) − s t − l ) + ( 1 − α )( a t − 1 + b t − 1 ) trend b t = β ( a t − a t − 1 ) + (1 − β ) b t − 1 seasonalit y s t = γ ( E ( t ) − a t ) + (1 − γ ) s t − l (1) where l is t he length of the season and α , β and γ are the smoothing parameters which control the inﬂ uence of the historic level, trend and seasonalit y . Predictions are made by combining level, trend and seasonal it y: ˆ E ( t ) = a t − 1 + b t − 1 + s t − l +1 . Prediction wi th web data. As leve l-0 mo dels w e use a linear mod el, bagging and weig hted ma jorit y . Ou r w eb data consists of time-stamp ed query frequencies (describ ed in S ection 3). Line ar mo del. Given a collection of n query frequency time- series, d enoted Q , w e deﬁne a simple linear model as: ˆ E ( t ) = µ + P n i =1 α i Q i ( t ), where µ and α are co eﬃcients to b e es- timated. Such a mo del can b e ﬁt ted using any of sever al metho d s, the most common b eing ordinary least squares. Another approach is to use LAS SO regularization which is commonly used for making predictions using query frequen- cies [15, 16, 18]. This approac h add s an additional con- strain t to the optimization, namely that the sum of the co- eﬃcien ts should also b e minimized. The w eigh t of th is sum is controlled by the hyp erparameter λ . This approach can b e used to a v oid o ve rﬁtting and to reduce the coeﬃcients of non-informative features to zero and thereby induce a sparse mod el. This is a useful p roperty in this context b ecause th e collection of queries might contain n on-informativ e terms. Bagging. With bagging, w e consider the av erage of the pre- dictions made on sub sets of th e training data. This h elps to reduce v ariance and ov erﬁtting. W e generate subsets of the training data by u niformly sampling with replacemen t n datasets of size m . F or each d ataset a linear model, as deﬁned ab o v e, is ﬁtted using LASSO regularization, where the parameter λ is found using 3-fold cross-v alidation. The prediction of the ensem ble is the a vera ge of t he n predictions. Weighte d Majority. W e exten d t he bagging approach to a b oosting approac h using a w eigh ted ma jority (WM) algo- rithm [10]. The WM algorithm wo rks by combining predic- tions from a collection of mo dels using a w eigh ted aver age. Eac h mo del is asso ciated with its ow n weigh t related to its previous predictive p erfo rmance. If the ov erall p rediction is wrong by a constan t ǫ , the w eigh ts are upd ated. The up dating wo rks as follo ws: if the individual prediction of a mo del has an error > ǫ , a new w eigh t is calculated as w i = w i exp( − η ), where w i is t h e we igh t for mo del i and η is a hyperp arameter control ling the p enalt y for making wrong p redictions. Our collection of mod els is identical t o the mo dels used for the b agging approach describ ed ab o ve. 3. EXPERIMENT AL EV ALU A TION Data 1 . W e ev aluate the eﬀectiveness of our approach in predicting v accination u ptak e in Denmark for all oﬃcial chil dren v accines: DiT eKiPol-1, D iT eKiP ol-2, DiT eKiP ol-3, DiT eKiPol-4, PCV-1, PCV-2, PCV-3, MMR- 1, MMR-2(4), MMR-2(12), H PV-1, H PV-2 and H PV-3. W e use as clinical data the actual vacci nation uptake recorded by the country’s oﬃcial b od y , th e State Serum In stitut. Sp eciﬁcally , the v ac- cination upt ak e is t h e total num b er of vacci nes given in a month divided by the num ber of p eople exp ected to b e v ac- cinated that month (based on t h e size of the monthly birth cohorts pu blished by St atistics Denmark). W e use as web data we b search queries that are related to eac h v accine. W e generate these q ueries from descriptions of eac h va ccine in: www.ssi. dk , www.p atienth ˚ andb o gen.dk , and www.min.m e dicin.dk (authoritative medical health p ortals). W e remov e stopw ords and collect terms that occur in at least tw o diﬀerent descriptions of each v accine. W e treat eac h term as a q uery (i.e. w e use only single term qu eries) 1 All our data is freely a v ailable at: https://sid.erda.dk/share redirec t/c7j6MdrscL V accine T erms in Danish (English) MMR levende (alive), mæslinger (m easles), vaccine, v accinen (the vaccine), udbrud ( outbreak), alvorlige (serious), f ˚ aresyge (mumps), m ˚ aneders (months), undersøgelser (examinations) beskytt else (protection), voksne (adults), gravid (pregnant), kom bineret (combined), dosis, hunde (dogs), alderen (the age), hjerneb etændelse (inﬂammation of the brain) lungebetændelse (pneumonia), gives ( is given), mfr ( mmr), røde (red) DiT eKiPol mæslinger (measles), vacci nen (the va ccine), alvorlige (serious), beskyt telse (protection), kombineret (combined), v accination, indeholder (contains), type, b eskytter (protects) sygdomme (illness), meningitis, for ˚ arsaget (caused), dræbte (killed), b, kighoste (who oping cough), v are (lasts), p olio, difteri (diphtheria), mindst (least), stivkrampe (tetanus) PCV v accinen (the v accine), alvorlige (serious), alderen (the age), lungeb etændelse (pneumonia), vaccination, infektioner (infections), sygdomme (illness), forebygger (prevents) meningitis, for ˚ arsaget (caused), antal (number), blod forgif tning (blood p oisoning) HPV beskyttelse (protection), gives (is give n), vaccination, tilbuddet (the oﬀer), kondylomer (condyloma), doser (doses), kønsvorter (genital warts), tilbydes (is oﬀered), piger (girls) livmoderhalskræft (cervical cancer), forventes (is exp ected), indeholder (contains), januar ( jan uary), langvari g (long term), ind ført (introduced), tilbud (oﬀer), typ e, human beskytt er (protects), eﬀekten (the eﬀect), skyldes (caused by), hpv, pigerne (the girls) T able 1: Our 58 queries. and we su b mit it t o Google T rend s using D en mark as th e geographical region and with the time p eriod set to Jan uary 2011 - Septemb er 2015 (only limited co vera ge of Denmark is av ailable prior to 2011). Only 58 out of 85 q ueries had enough cove rage in Go ogle T rends to return a result. W e use t h ese 58 q ueries for our predictions (show n in T able 1). T raining. W e u se as training data all d ata which is av ail- able prior t o th e data p oint b eing predicted. H ence if w e are predicting the vacc ination up t ak e in F ebruary 2014 we train on data from Jan uary 2011 – January 2014. All mo dels are reﬁtted for each time step. W e use monthly time step s. T o allo w for inference of seasonalit y , th e level-0 mo dels are initialized with 24 months of av ailable data (January 2011 – December 2012) as training data. F or the level-1 mo dels w e start by using 12 months of d ata (January 2013 – De- cem b er 2013). W e ev aluate our predictions using t he ro ot mean squared error (RMSE), which p enalizes large errors more than small. Our prediction metho ds are ﬁtted using R pack ages with default settings at all times, except for the starting p oin t for HW, where w e manually select a starting p oint of the opti- mization if it cann ot b e completed with th e default v alue. The AR mo del is trained using 12 autoregressive terms to capture seasonal v ariations. F or bagging and weigh ted ma- jorit y w e use as man y subsets as there are qu eries, eac h sub- set contains 10 randomly sampled queries. F or the wei ghted ma jorit y we use η = 5 and ǫ = 2 for all ex periments. Results. T able 2 shows the results when predicting v accina- tion u ptak e using either clinical or w eb d ata only (with the metho d s presen ted in Section 2). “Naive” refers to our naive baseline ˆ E ( t ) = E ( t − 1). Our method s outperform the naive baseline ex cept for the HPV va ccines. This migh t b e due to an intense debate in Den mark regarding the safety of this particular v accine. Su c h a debate is likely to b oost q u ery frequencies b u t n ot necessarily va ccination uptake (the fact that many more people t alk about HPV does not mean that many more HPV v accines are given). W e see th at meth- ods using clinical data outp erform the metho ds u sing web data for t he ma jority of th e v accines. But interestingly this diﬀerence is not very big and for the va ccines DiT eKiP ol-3 and DiT eKiP ol-4 the metho ds b ased on web d ata p erform b est. DiT eKiPol-4 is esp ecially interesti ng since a shortage in 2013 resulted in unusual va ccination b ehaviour for a few months. When making predictions from w eb data our tw o new approaches (bagging and WM) p erform b est for 9 of the 13 va ccines. T able 3 sho ws th e results for the ensemble predictions us- ing clinical and web d ata. Except for the three H PV v ac- cines, t h e ensemble approac hes outp erform all other meth - ods u sing only one data source. W e see that when using an S VR with a Gaussian kernel as level-1 mo del we obtain the b est results, i.e. 7/13 lo w est RMSE. Wh en comparing Clinical data W eb data Naive HW AR12 ARIMA WM B L O MMR-1 20.704 18.149 18.606 15.574 16.609 16.597 16.605 30 .387 MMR-2 (4) 20.582 13.110 16.566 16.284 15.841 15.635 15.500 29 .288 MMR-2 (12) 20.637 19.592 20.600 18.726 21.631 20 .815 21.112 31.897 HPV-1 8.080 11.291 11.192 9.871 13.4 74 14.32 0 12.701 11 .547 HPV-2 8.704 12.522 12.806 11.276 18.154 18. 025 18.423 15.404 HPV-3 6.579 9.1 61 13.958 9.418 24.239 23.494 23.074 17.317 DiT eKiPol-1 14.091 6.700 5.185 5.097 8.067 8.058 8.069 15.913 DiT eKiPol-2 17.693 7.520 8.030 8.064 10.003 9.941 9.951 20.082 DiT eKiPol-3 17.884 17.596 20.936 19.459 17.160 17.160 17.158 30.424 DiT eKiPol-4 21.676 26.103 21.676 23.385 15.414 15.535 15.934 33.888 PCV-1 13.323 6.897 6.394 6.623 7.745 7.797 7.845 14.014 PCV-2 17.533 7.266 8.845 8.353 9.679 9.796 9.770 16.027 PCV-3 18.405 7.877 7.781 7.634 10.410 10.364 10.368 15.582 T able 2: RMSE of predictions with only clinical or web data. WM: weigh te d ma jority , B: Bagging, L: linear mo del w. LASSO and O: li near mo del w. OLS. Blue: l o west RMSE p er v accine. Bold: b etter than naive. within the meth o ds using an SVR with a Gaussian kernel, the HW+WM is the b est performing method. The most im- prov ements are obtained when combining p redictions based on web data with either predictions from H W or AR 12. 4. CO NCLUSIONS W e p resen ted a meth od that uses ensemble learning to com bine clinical and w eb-mined time-series d ata to make predictions about future va ccination up t ak e. As clinical data we used oﬃcial registries of v accines in D enmark. A s w eb data we u sed query frequencies collected from Go ogle T rend s. W e created th ose queries by extracting terms from publicly a v ailable descriptions of the v accines on the web. Exp erimen ts u sing all oﬃcially recommended children v ac- cines in Denmark for t he p eriod Jan uary 2011 – Sept emb er 2015 show ed that for 10/13 va ccines our ensemble learning metho d s that combined clinical with web data for p redic- tion outp erformed p redictions using either clinical or web data alone. Though this com bination yields the low est over- all error, using on ly web data gives predictions with an er- ror only slightly wo rse than for the predictions made using only clinical data. This indicates th e p otential usefulness of w eb data, such as qu ery frequ encies, t o predict v accination uptake in countries where there is n o national vac cination registry . This work complements wider eﬀorts in tackling medical and h ealth problems computationally with mac hine learning or retriev al [20, 21]. 5. REFERENCES [1] E. H. Chan, V. Sahai, C. Conr ad , and J. S. Bro wnstein. Using w eb search query data to monitor de ngue epidemic s: A ne w mo del for neglecte d tropic al disease surv eillance . PLoS Negl T r op Dis , 5(5):e1206, 2011. [2] C. Chatﬁeld. The analysis of time series: An intr o duc tion . CR C press, 2013. [3] R. Chunara, J. R. Andr e ws, and J. S. Brownstein. So cial and news med ia en able estimation of epid emiological patterns early OLS HW+WM H W+B HW+L HW+O AR12+WM AR12+B AR12+L AR12+O AR IMA+WM ARIMA+B AR IMA+L ARIMA+O MMR-1 15.190 15.476 15.187 16.842 17.697 17.572 17.457 18.151 16.296 16.492 15.968 17.877 MMR-2 (4) 12.875 13.497 13.349 13.121 16.305 16.108 16.195 16.032 18.872 16.220 21.085 15.871 MMR-2 (12) 18.082 17.650 17.541 17.221 18.711 19.523 18.762 18.909 18.469 19.409 18.961 18.693 HPV-1 10.552 10.435 10.960 11.516 9.377 9.690 10.130 10.281 10.348 10.080 9.992 10 .080 HPV-2 12.743 12.923 14.384 12.191 11.883 12.201 13.240 10.503 10.708 10.655 10.279 9.220 HPV-3 8.743 8.771 10.231 9.987 11.321 11.063 12.110 12.151 9.893 9.237 9.818 9.918 DiT eKiPol-1 6.416 6.875 6.477 5.498 4.835 4.831 4.829 5.082 6.07 2 5.690 5.625 5.584 DiT eKiPol-2 9.094 8.216 8.967 7.956 7.686 7.343 8.116 8.019 9.4 61 8.989 15.485 9.057 DiT eKiPol-3 18.478 17.891 18.410 16.662 17.529 18.225 17.550 18.439 17.168 17.227 17.137 19.076 DiT eKiPol-4 15.812 17.977 17.495 19.860 17.290 19.891 16.537 19.849 24.391 45.079 36.403 24.220 PCV-1 7.042 5.783 5.716 5.391 6.201 5.950 5.830 5.973 10. 785 6.569 9.174 6.169 PCV-2 8.317 8.236 9.283 7.553 10.135 8.670 10.401 8.284 8.395 8.399 9.681 8.330 PCV-3 7.345 7.436 8.199 7.759 6.825 7. 014 7. 108 6.736 7.931 8.364 8.240 7.670 SVR linear HW+WM H W+B HW+L HW+O AR12+WM AR12+B AR12+L AR12+O AR IMA+WM ARIMA+B AR IMA+L ARIMA+O MMR-1 16.541 15.969 15.478 16.905 17.298 17.252 17.399 18.496 19.406 16.793 16.857 18.204 MMR-2 (4) 12.648 12.981 12.388 12.952 16.148 15.373 16.663 15.095 14.969 16.101 15.419 15.620 MMR-2 (12) 17.906 18.054 17.872 17.374 19.302 19.020 18.248 19.075 19.123 18.193 19.195 17.731 HPV-1 10.530 10.649 10.922 11.146 10.486 10. 494 10.688 10.657 10.331 10.789 10.810 10.448 HPV-2 13.489 12.344 12.130 12.425 10.147 10.467 11.984 11.062 9.738 9.906 10.482 8.727 HPV-3 8.903 8.260 8.501 10.674 11.732 11.718 13.049 12.617 9.806 10.001 9.484 10.411 DiT eKiPol-1 6.926 5.993 5.739 6.500 4.740 4.758 4.626 5.077 5.090 5.354 5.287 5.507 DiT eKiPol-2 9.808 9.476 9.006 9.100 8.476 8.563 10.503 7.734 9.938 8.480 9.872 9.591 DiT eKiPol-3 22.546 22.599 22.546 17.433 21.402 21.925 21.154 19.003 22.244 21.319 22.104 19.568 DiT eKiPol-4 16.694 37.909 17.357 14.461 15.922 16.405 16.124 16.164 22.072 26.591 18.984 17.609 PCV-1 7.351 7.186 6.175 6. 198 5.282 6.143 6.335 5.510 6.765 7.413 6.840 6.830 PCV-2 7.689 7.946 15.613 7.955 9.029 8.879 9.104 8.823 8.794 1 1.662 14.559 9.024 PCV-3 7.648 8.261 8.388 7.784 6.904 6.758 6.994 6.633 9.649 9.384 9.058 8.491 SVR Gaussian HW+WM H W+B HW+L HW+O AR12+WM AR12+B AR12+L AR12+O AR IMA+WM ARIMA+B AR IMA+L ARIMA+O MMR-1 14.928 16.694 16.355 16.198 17.770 18.207 18.117 16.927 16.703 17.490 17.569 17.560 MMR-2 (4) 14.377 12.870 14.094 13.122 15.709 14.780 15.024 15.468 14.973 15.109 16.625 15.838 MMR-2 (12) 18.007 17.446 18.972 16.530 17.945 19.115 18.406 18.176 18.385 18.553 19.625 19.041 HPV-1 10.748 11.606 11.918 11.289 11. 002 10.902 11.403 9.130 11.6 64 10.684 11.505 9.987 HPV-2 13.513 11.958 12.304 12.097 10.789 12.376 9.964 12.483 10.537 11.249 11.7 15 10.961 HPV-3 12.889 13.784 14.775 13.352 13.016 14.521 15.222 14.374 12.818 12.172 12.147 12.682 DiT eKiPol-1 5.204 5.008 5.098 5.331 5.486 5.4 67 5.486 6.227 5.725 6.393 5.577 6.615 DiT eKiPol-2 7.927 8.017 8.172 9.661 7.149 7.443 7.818 8.078 9.009 9.870 9.848 8.721 DiT eKiPol-3 16.639 16.448 16.433 17.275 18.650 18.442 19.009 18.355 18.380 17.545 18.298 18.962 DiT eKiPol-4 15.616 14.877 15.246 16. 543 16.865 16.038 16.687 15.653 15.938 15.647 15.923 15.932 PCV-1 5.256 5.808 5.769 5.664 5. 450 6.358 6.363 6.614 6.724 7.160 6.611 7.091 PCV-2 6.463 7.665 7.366 7.470 8.811 7.450 6.952 9.026 8.672 9.062 8.991 9.148 PCV-3 7.121 7.665 8.396 7. 798 9 .022 9.556 10. 008 7.871 8.616 8.148 9.527 8.176 T able 3: RMSE of ensemble predictions (clinical and w eb dat a). Blue: lowest RMSE p er v accine. Bold: low e r RMSE than for the i ndividual ensemble comp onen ts in T able 1. in the 2010 Haitian cholera outbreak. The A meric an Jou rnal of T r opic al M e dicine and Hygiene , 86(1): 39–45, 2012. [4] L. M. Gargano, N. L. Herb ert, J. E. Pain te r, J. M. Sale s, C. Morfaw, K. Rask, D. Murray , R. DiClemente, and J. M. Hughes. Impac t of a physician recomme ndation and p aren tal immunization attitudes on rec e ipt or in tention to receive adolescent v accine s. Human vaccines & immunother ap eu tics , 9(12):2627–2633, 2013. [5] L. M. Gargano, N. L. Unde rw o o d, J. M. Sales, K. Seib, C. Morfaw, D. Murray , R. J. Di Clemente, and J. M. Hugh es. Inﬂuen ce of sources of inf orm ation ab out i nﬂuenza v accine on parental attitudes and adolesce n t v accine rec eipt. Human vac cines & immunotherap eutics , 2015. [6] J. Ginsb erg, M. H. Moh ebbi, R. S. Patel, L. Bramme r, M. S. Smolinski, and L. Brilli ant. Detecting i nﬂuenza e pidemics u sing search engine query data. Nature , 457(7232):1012–1014, 2009. [7] R. J. Hyndm an and Y. Khandak ar. Automatic time seri es for forecasting: The f orecast pack age for R. T e c hnical rep ort, Monash University , Departm en t of Econometri c s and Busine ss Statistics, 2007. [8] B. J. Kelly , A. E. Leade r, D. J. Mittermaie r, R. C. Horn ik, and J. N. Capp ella. The HP V v accin e and the med ia: How has the topic b een cov er e d and what are the eﬀects on k n o wle dge ab out the virus and c ervical cancer? Patient educ ation and c ounseling , 77(2):308–313, 2009. [9] D. Lazer, R. Ke nnedy , G. King, and A. V espignan i . The parable of Google ﬂu: T raps in b ig data analysis. Scienc e , 343(14 March), 2014. [10] N. Littlestone and M . K. W armuth. The weigh ted ma jority algorithm. Information and computation , 108(2):212–261, 1994. [11] K. Ma, W. Schaﬀner, C. Colmenares, J. Howser, J. Jones, and K. Poeh ling. Inﬂuenza v accination s of young children increased with med ia co v erage in 2003. Pediatrics , 117(2):e157–e163, 2006. [12] D. J. McIver and J. S. Bro wnstein. Wi kipe d ia usage e stimates prev alence of inﬂuen za-lik e il lness in the Unite d S tates in near real-time. PL oS Comput Biol , 10(4): e1003581, 2014. [13] M. J. Paul, M. Dred ze, and D. Broniatows ki. Twitte r improv e s inﬂuen za forec asting. PLoS curr ents , 6, 2014. [14] P . M. Polgreen, Y. Chen, D. M. Penno c k, F. D. Nelson, and R. A. W e instein. Using internet searches f or inﬂuenz a surveillance. Clinic al infe ctious dise ases , 47(11):1443–1448, 2008. [15] M. Santillana, A. T. Nguyen, M. Dredz e, M . J. Paul, E. O. Nsoe sie, and J. S. Brownstein. Combining search, social m edia, and traditional data sources to i mpro v e inﬂue nza surveillance. PL oS Com put Biol , 11(10):e1004513, 2015. [16] M. Santillana, E. O. Nso esie, S. R. Mek aru, D. Scales, and J. S. Brownstein. Using c l inicians’ search q u ery data to monitor inﬂu enza ep idemics. Clinic al Infe ctious Dise ases , 59(10):1446–1450, 2014. [17] M. J. Smith, S . S. Ellenb e rg, L. M. Bell, and D. M. Rubin. Media cov erage of the m easles-mumps-rubel la v acc ine and autism contro v ersy and its relationship to MM R immunization rates in the United States. Pediatrics , 121(4):e836–e843, 2008. [18] S. Y ang, M. S an tillana, and S. Kou . Accurate estimation of inﬂuen za epid emics u sing Google search data via ARGO. Pr oc ee dings of the National Ac ademy of Sciences , 112(47):14473–14478, 2015. [19] Q. Y uan, E. O. Nso esie, B. Lv, G. Peng, R. Ch unara, and J. S. Bro wnstein. Monitorin g i nﬂuenza epide mics in China with search query from Baidu. PloS one , 8(5):e64323, 2013. [20] R. Dragusin, P . Petcu, C. Lioma, B. Larsen, H. Jørgensen and O. Winther. Rare Disease Di agnosis as an Inf ormation Retriev al T ask. ICTIR , 356–359, 2011. [21] R. Dragusin, P . Petcu, C. Lioma, B. Larsen, H. L. Jørgense n , I. J. Cox, L. K. Hansen P . Ingwersen and O. Winther. Sp ecialize d tools are neede d when searching th e w eb for rare disease diagnoses. Rar e Dise ases , 1(1):e25001, 2013.

Ensemble Learned Vaccination Uptake Prediction using Web Search Queries

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment