Leveraging Synthetic and Genetic Data to Improve Epidemic Forecasting

Lev eraging Syn thetic and Genetic Dat a to Impro v e Epidemic F orecasting Da v e Osth us 1 , Alexander C. Murph 1 , Emma E. Goldb erg 2 , Lauren J. Beesley 1 , William M. Fische r 2 , Nidhi K. Parikh 3 , and Lauren A. Castro 3 1 Statistics Group, L os Alamos National Lab orato ry 2 Theoretical Biolog y and Bioph ysics Group, Los Alamos National Lab oratory 3 Information Systems and Mo deling Group, Los Alamos National Lab oratory Marc h 26 , 2026 Abstract F orecasting infectious disease outbreaks is hard. F orecasting emerging infectious diseases with limited historical data is even harder. In th is pap er, we inv estigate w ays to improv e emerging infectious dise ase forecasting under op erational constraints . Sp eciﬁcally , w e explore tw o options lik ely to b e av ailable near the start of an emerging disease outbreak: synthetic data and genetic information. F or this investi gation, we conducted an exp eriment where we trained deep learning mod els on diﬀerent com binations of real and synthetic data, b oth with and without genetic in- formation, to explore how these models compare when forecasting COVID-19 cases for US states. All mo dels are d evelo p ed with an eye to w ards forecasting th e next pandemic. W e ﬁ nd that mo d- els trained with synthetic data h a ve b etter forecast accuracy than mo dels trained on real d ata alone, and mo dels that use genetic va riants ha ve better forecast accuracy compared to those that do n ot. All mo dels outp erformed a baseline p ersistence model (a feat only accomplished by 7 out of 22 real-time COVID-19 cases forecasting mo dels as rep orted in [ 38 ]) and multiple mo d- els outp erformed the COVIDHub -4 w eek en semble. This pap er demonstrates the val ue of these underutilized sources of information and provides a blu eprint for forecasting future pandemics. 1 In tro duction Over the past deca des, infectious disease for ecasting has trans formed fro m an aca demic cur iosity into a critical too l for public hea lth prepar edness a nd resp onse. This growth ha s b een driven by adv a nces in mo deling [ 9 , 46 , 48 ] and data av ailability [ 15 ], and b y the recognition that timely foreca sts can guide resource allo cation, inform p olicy , reduce the cost burden asso ciated wit h emerg ing health threats, and improv e public hea lth o utcomes [ 33 ]. Y et forecasting rema ins inherently diﬃcult: challenges include noisy and inco mplete data [ 50 ], lags in repo rting [ 4 ], reﬂexive human b ehavior [ 39 ], uneven p olicy decisions [ 31 ], and pathogen evolution [ 3 5 , 3 ]. These obstacles a re particularly pro nounced during rapidly evolving outbre aks, whe re even the b est mo dels struggle to keep pace [ 38 ]. The COVID-19 pandemic brought these struggles into acute fo cus [ 27 ]. O n one hand, it ma rked an unprecedented mobilization of forecasting talen t and infrastructure [ 13 , 51 ]; on the other, it re vealed gaps—par ticularly in int egr ating real-time signals, q uantif ying uncer taint y , a nd anticipating the emer gence of new vira l dynamics. Among the success stor ies o f the COVID-19 r esp onse was the rapid a nd widesprea d ado ption of genomic surveillance [ 37 ]. SARS-CoV-2 genomes w ere sequenced and shared glo bally at u nprecedented scale and sp eed, oﬀering a nea r rea l-time c ontin uously up dated feed of the virus’s evolution [ 5 2 , 34 ]. With re latively lo w technical barr iers and declining se quencing costs, g enomic s urveillance is p oised to remain a cor nerstone in future outbreak s. Crucia lly , it enabled the ea rly iden tiﬁcation of new v ariants, which often preceded observ able surge s in ca ses [ 1 7 ]. This makes v ariant tracking a p otential leading indicator—a rar e and p ow erful feature in infectious disease forecasting. Lo o king ahead, this strea m o f high-reso lution, biolog ically meaningful data o ﬀers a promising direction fo r improving fore cast mo del accuracy . 1 In genera l, for ecasting systems p erform b est when three conditions are met: (1) the system’s underlying mechanisms are w ell-characterized, (2) ther e is suﬃcient histor ical data to train mo dels, and (3) future trends don’t devia te to o sig niﬁcantly from past ones [ 26 ]. Emerg ing infectious dise ases t ypically v iolate the ﬁr st tw o co nditions, a nd sometimes a ll three. While mechanistic mo dels—such as compartmental mo dels [ 7 ], age nt -base d mo dels [ 5 3 ], or phylodynamic mo dels [ 20 ]—can capture transmission and/or evolutionary dynamics, they are often diﬃcult to ﬁt to incomplete and no isy real-time data, a nd hav e often be en outp erfor med by their more ﬂexible statistical or machine learning mo del counterparts in real-time foreca sting exe rcises [ 40 ]. T o p erform well, how ever, these high- po tent ial machine lea rning forecasting mo dels r equire substantial training data—data ra rely av ailable at the onset of an outbreak from a no vel pathogen. In the absence of adequate training data, a machine learning mo del’s ﬂex ibilit y can be its w eaknes s, resulting in nonsensic al forecas ts. In this paper, we seek to dev elop forecasting models that are accurate, scala ble, and easily deplo yed in a pandemic se tting where little to no historical data are av ailable for the pathogen o f interest. W e consider tra nsformer-ba sed deep lear ning mo dels f or forecasting b ecause, although often costly to train, they a re cheap to deploy , can main tain strong pe rformance on new data without freq uent retra ining; they allow sca ling to many data subs ets (e.g., geo graphies), a nd oﬀer hig h per formance ceilings . The conspic uous iss ue for for ecasting emer ging infectious diseases using deep learning mo dels? T raining data. T o lea rn, these mo dels re quire lar ge amounts of tr aining data, yet for an e merging pathogen (b y deﬁnition), datasets are limited. In this setting, essentially tw o types of tra ining da ta are av ailable: histor ical outbreak data from other pathoge ns, and sy nt hetic data. Neither da ta source will p erfectly r epresent the pa thogen of co ncern, yet b oth are av ailable at outbr e ak onset and so ca n be used to tra in deep lear ning mo dels in real-time. Historical o utbreak datas ets are ﬁnite, re stricted to outbreaks tha t hav e o ccur red and were mea- sured, a nd thus repres ent o nly a po rtion o f the space of p ossible outbreaks. Recent work has demon- strated succe ss in incor p orating data from diﬀerent pathogens and surveillance strea ms to improve forecasting [ 48 , 4 9 ]; this strateg y is enabled by the public dissemination of public hea lth da ta [e.g., 55 ]. Thus, while historic al data do no t repr esent the emerg ing pathoge n (the fo recast tar get), they do represent ostensibly r elev ant da ta, including data repo rting v agaries, us eful for mo del training. Synt hetic datasets address many of the shortcoming s of historical d ata. They ar e inﬁnit e in n um b er (in pr inciple): their genera tion is co nstrained primarily by compute reso urces. F urthermor e, they can represent a diversit y of o utbreaks limited only by the choice of simulation para meters; mea surement noise and biases can a dded separ ately to mimic r ealistic surveillance systems. Resea rchers have recently shown s uccess o f mo dels trained exclusively on synt hetic outbreak data [ 18 , 4 4 , 43 ]. This b eing said, the realism of these outbreaks and the p otential gains of using sy nt hetic data in mo deling are restricted by the ﬁdelity of the simulator and by the meas urement erro r pr o cesses. Synt hetic data m ust confor m to real or prosp ective data as it is (or will b e) measured. Ma ny infectious disea se mo dels ba sed on ﬁrst principles (e.g., compartmental mo dels or a gent-based models) can straig ht forwardly g enerate time series o f case -counts, hospitaliza tions and deaths. T o make use o f viral genetic v ar iant informa tion, as we do here, a synthetic infectious disease simulator m ust mo del outbreaks a t the v ariant level, and aggregate v ariant data to pro duce “obser ved ca se-count” time series. In t his pap er w e mak e use of one suc h simulator, MutA ntiGen [ 32 ] (discussed in detail in Sectio n 3.2.1 ). This simulator pro duces bo th time s eries of to tal observed case s (refer red to as total cases, or TCs) as well as time se ries of each co nstituent v ariant (re ferred to as v ariant-attributable ca ses, or V A Cs). Given this framing, we seek in this work to answer the following ﬁve r esearch questions : Q1: Do es t ra ining with r e al data or synthetic data pr o duc e b etter for e c ast p erformanc e? Q2: Do es joint tr aining with r e al and syn t hetic data impr ove COVID-19 c ase for e c asts r elative to tr aining with either sour c e individual ly? Q3: Two questions c omp aring synthetic TC and V A C t r aining data: Q3a: Do es tr aining with synthetic V A C data impr ove COVID-19 for e c asts r elative t o tr aining with synthetic TC data? Q3b: Do mo dels with matche d tra ining data and input data outp erform mo dels with mismatche d tr aining data and input data? Q4: Do es including SAR S-CoV-2 variant information impr ove COVID-19 c ase for e c asts? 2 Q5: How do these for e c ast s c omp ar e to r e al-time COVID-19 c ase for e c asts? T o answer these ﬁve q uestions, we ﬁt and compare eight fo recasting mo dels that diﬀer in bo th the data used to train the mo de ls and the inputs to the mo dels, describ ed in T able 1 . While eig ht mo dels are deﬁned in T able 1 , they corresp ond to four diﬀerent ﬁtt ed deep lear ning models trained on diﬀ erent data sets (real, synthetic TCs, synthetic V A Cs, or real plus s ynthet ic), each applied to tw o diﬀerent mo del input types (TCs o r V ACs). F or co ncreteness, mo dels M(r,t) and M(r,v) ar e using the same ﬁtted deep learning mo del (the mo del tra ined o nly with real training data) but M(r,t) pr edicts total cases directly and M(r,v) forecas ts ea ch v ariant-attributable c ase dir ectly and sums up the individual forecasts (details in Sectio n 4.2 ). T a ble 1 : F oreca st mo del conﬁguratio ns by training data s ource a nd input time series. The model naming conven tion is “M(training da ta, input t yp e)” where tr aining data ca n b e r = r eal, s t = synthetic total cases (TCs), sv = synthetic v aria nt -attributable case s (V ACs), and a = all s ources (real plus synthetic total cases plus s ynt hetic v ariant-attributable cases ) and input types ca n b e t = total cases and v = v ariant-attributable c ases. Mo del T raining Data Input Time Series M(0) N/A (per sistence bas eline) N/A M(r,t) Real TC M(st,t) Synt hetic, TC TC M(sv,t) Synt hetic, V AC TC M(a,t) Real + Synthetic TC M(r,v) Real V A C M(st,v) Synt hetic, TC V A C M(sv,v) Synthetic, V A C V AC M(a,v) Real + Synthetic V A C The mo dels deﬁned in T able 1 allows for a systematic ev aluation of how synthetic data and g enetic information can improv e foreca st accurac y by comparing for ecast p erforma nce of pairs of mo dels. Spec iﬁcally , • ( Q1 ) If mo dels tra ined with synthetic data outp erform mo dels tr ained with rea l data, we would exp ect M(st,t) > M(r,t) and M(sv,v) > M(r,v), wher e M(A) > M(B) means M(A) o utper formed M(B). • ( Q2 ) If joint training with rea l and synthetic data improves COVID -19 cas e fore casts r elative to training with either so urce individually , we would exp ect M(a,t) > [M(r,t), M(st,t)], and M(a,v) > [M(r,v), M(sv,v )]. • ( Q3a ) If training with synthetic V ACs improv es COVID -19 case forecas ts relative to training with synthetic TCs, we would e xp ect M(sv,v) > M(st,t). • ( Q3b ) If m o dels with matched tra ining data and input data outper form mo dels with mismatc hed training data a nd input data, we would exp ect M(st,t) > M(sv,t) and M(sv ,v) > M(st,v). • ( Q4 ) If including SARS-CoV-2 v ar iant information improv es CO VID-19 case forecasts, we w ould exp ect M(r,v) > M(r,t), M(st,v) > M(st,t), M(sv,v) > M(sv,t), and M(a,v) > M(a,t). Q5 will b e ev aluated with external mo dels later . In Section 2 , we descr ibe the scop e of the pro ject along with the data used in this exercise. In Section 3 , we present the synthetic data simulator and asso ciated details. W e describ e the infectious disease forecas ting mo del in Section 4 . In Section 5 , we prese n t the results of the exercise, including direct answers to a ll rese arch questio ns stated ab ove. Finally , in Section 6 we discuss implications, limitations, and future directions of work. 3 2 CO VID-19 Study Details and Data O ve rview 2.1 Scope of Study The study details are presented in T able 2 . COVID-19 ca ses were selected as the target b ecause they serve a s a lea ding indicator o f more severe o utcomes like hospitaliz ations and deaths a nd were empirically diﬃcult to for ecast [ 38 ]. The time rang e and cadence corresp ond to data a v ailability a nd public health rele v ance. U.S. states (plus P uerto Rico) were selected as a geogr aphically r elev a nt unit for public health. T here is a large volume of SARS-CoV- 2 genomes for US states b etw een June 2020 and Decem b er 2022, allowing us to test the v alue of g enetic information. F urthermor e, ma jor COVID-19 data collection and dissemination resour ces r amp ed down o pe ration in March of 2023 [ 29 ]. T a ble 2: Study deta ils, including the ev aluation metrics of mean a bsolute er ror (MAE) and weighted int erv al score (WIS). Category Details Disease and T arget COVID-19, Cases Time Range June 20 20 – December 2022 Time Cadence W eek ly Geographic Extent USA Geographic Resolution States /T erritories F o recast Horizon 1–4 weeks ahea d Ev a luation Metrics MAE, WIS, a nd c ov erage 2.2 Real COVID-19 Case Data Data on COVID-19 cases betw een December, 2019 and March, 2023 w ere obtained fro m the Johns Hop- kins Univ ersity Center for Systems Science and Engineering (CSSE) GitHub ( https: //git hub.com /CSSEGISandData/COVI [ 1 ]. T his databa se includes COVID-19 case da ta co mpiled from a v ar iety of sources ; o ur analy sis used the case counts rep orted by the s ource viewed as the most tr ustw orthy for each lo ca tion and date. Delays in r eal-time rep or ting of COVID-19 ca ses were ig nored in this analysis, a nd the data r ep orted as of September 20 23 (the da te the da ta were pulled) for a given date were treated as known as of that date. As ex amples, we show the weekly tota l COVID-19 case counts fo r Alabama and California ov er the study p erio d (Figure 1 ). 4 California 2020 2021 2022 2023 Alabama 2020 2021 2022 2023 0 50k 100k 0 200k 400k 600k 800k T otal Cases (a) California 2020 2021 2022 2023 Alabama 2020 2021 2022 2023 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Other Alpha Epsilon Iota Gamma Beta Mu Delta Omicron (BA.1) BA.2 BA.4/BA.5 XBB V ar iant Propor tions (b) California 2020 2021 2022 2023 Alabama 2020 2021 2022 2023 0 50k 100k 0 250k 500k 750k V ar iant Attributable Cases (c) Figure 1: COVID-19 data for Alabama a nd Califor nia. (a) W eekly to tal cas es (TCs). (b) Prop or tion of sampled vira l genomes assigned to each v ar iant. (c) V aria n t-attributable case s (V ACs), computed as TCs times the propo rtion o f genomes ass igned to each v a riant. V ACs summed ov er a ll v a riants equal the TCs. Note the sq uare-ro ot s cale on the y- axis for b etter visibility of low-count V ACs. 2.3 Real COVID-19 Genetic Data The CO VID-19 pandemic generated an unpre cedented global collection of viral genome seq uences, largely co o rdinated thro ugh r ep ositories like GISAID [ 52 ], which beca me a central h ub for shar ing SARS-CoV-2 genomes and metadata. In this pap er, we made use of appr oximately 4.5 millio n se- quences gathered in the United States betw een June 1st, 2020 and Decem b er 31 st, 20 22 a v ailable through GISAID. Rather than using raw genomic se quences, in this w ork we gr oup sequence v ariants by Pango lineag e desig nation [ 47 ] a s as signed in the GISAID meta data a s of September 202 3; each genome is assigne d to one of many discrete v ariant categories , which we aggre gate into coher ent en- compassing sup er groups. Agg regating these lab els acr oss time and loca tion yields v aria nt prop o rtion time s eries, such as those shown in Figur e 1 (b). W e combine v a riant pr op ortion time ser ies with total cases time series to compute v ariant attributable cas e (V A C) time ser ies (e.g., Figur e 1 (c)), where v ar iant-attributable cases for ea ch time po int equal to tal cases times v aria nt prop o rtion. W e note that there is a meaningful delay b etw een when a vir al s ample is colle cted and when its genome and metada ta b ecome publicly av a ilable. That la g — driven by lab turnar ound, qua lit y control, metada ta completion, and curation — v aries across geo graphy and time [ 8 ]. This paper , lik e many r etrosp ective studies, neglec ts this delay , but o per ational forecasting w ould need to mo del it. As such, the r esults in this pa per for the mo dels tha t use V A Cs as their input time serie s should b e viewed in their a ppropriate context. 3 T raining Data 3.1 Real, non-CO VID-19 Respiratory Disease Data As early as late 20 19, the outbreak la ter attributed to SARS-CoV-2 w as describ ed clinically as an acute respirator y illness (pneumonia) [ 57 ]. While little to no COVID-19 da ta would hav e b een av ail- able on January 1 st, 20 20, we could have known that COVID-19 pro duced symptoms co nsistent with r espirator y diseases . F o r this reason, w e use non-COVID-19, rea l res piratory data av ailable prior to January 1 st, 20 20 for training in this pap er . All data and co de used in this pa pe r are av ailable at https: //git hub.com/lanl/precog/tree/main/synthetic_and_genetic_forecasting , originally derived from data found at https: //git hub.com/lanl/precog/tree/main/infectious_timeseries_repo . Non-COVID-19 respira tory diseases include inﬂuenza, pneumonia , mumps, RSV, tub erculo sis, and diph theria (among other s). In this pap er, we will often use the shor thand “rea l data” or “rea l training 5 data” to mean “non- COVID -19 , rea l res piratory diseas e data.” F or ex ample, the “T raining Data = Real” in T able 1 mea ns “non-COVID-19, real r espirator y disease da ta.” T a ble 3: T r aining time series summaries. “Real” means “non-COVID-19, real respira tory ,” “TC” means “T otal Cases,” and “ V A C” means “V aria nt -Attributable Cases.” T raining Data # of Time Serie s # of T otal Obs . Avg. (Medi an) Ti me Se ries Leng th Real 2,167 2,169,7 60 1001 (551 ) Synt hetic, TC 36,600 9,460,880 258 (256) Synt hetic, V AC 36,570 10,66 4,618 292 (294) As can b e se en in T able 3 , ab out 2,00 0 non-COVID-19, real respir atory time ser ies are av ailable for training. Those time series hav e an average length of ab out 1000 observ ations, and a median leng th of ab out 5 50 observ ations. A se lection of the non-COVID-19, r eal r espirator y time ser ies are shown in Figure 2 . 0 250 500 750 2000 2005 2010 2015 2020 Cases Influenza Australia 0 50 100 150 200 2000 2005 2010 2015 2020 Cases Influenza South Africa 0 10 20 30 1930 1940 1950 1960 1970 Cases Diphtheria North Dakota, USA 0 10 20 30 40 50 1915 1920 Cases Pneumonia Essex County , Massachusetts, USA Figure 2 : Examples of non-COVID-19, real respirato ry data. O ver 2,000 time s eries are av ailable for training, amounting to ov er 2 million observ ations. 3.2 Sy n thetic Data In addition to the real data, we generated synthetic data to represe n t a wide ra nge o f p ossible disease behaviors. The inten t o f these synthetic data is to discov er b ehaviors that a re within scop e of p os sible disease dynamics, yet not ex plicitly repr esented in the av aila ble r eal data observ ations. W e gener ate synthetic disea se data via the MutAnt iGen agen t-based model (ABM) [ 2 , 32 , 11 ] beca use it can generate viral v ar iant turnov er dynamics. 3.2.1 MutAn tiGen In a gent-based mo deling, a la rge-sc ale ecolog ical system is simulated as a collection of autonomous decision-making ent ities ca lled agents [ 5 ]. In the MutAntiGen ABM, origina lly develop ed [ 32 ] a s an extension of an anteceden t ABM [ 2 ], “agents” ar e categorize d as either infected or non-infected individuals in a popula tion susceptible to disease spread. MutAn tiGen explicitly mo de ls the join t behavior of an evolving pa thogen and dynamic susceptibility/resistance of the host p opulation, allowing for non-seas onal c ase wav es. F urther details on MutAntiGen a re av ailable in the Supplement ary Materials Section S1 . Some example runs of MutAntiGen are s hown in Fig ure 3 . In addition to cases over time, the simulator reports vira l samples from a subset of infections. This yields time series of ca ses attributable to antigenic t y pe s, as s hown in the low er panels of Figure 3 . Viral sampling typically is prop ortiona l to nu mber o f cases , but this yields very few sa mples when cases ar e low b efore a new wa ve begins , which is exactly when da ta are critical for a forecas ting mo del. W e therefore mo diﬁed the MutAnt iGen co de to sample with gr eater int ensity when cas es a re low. As one migh t imagine, an ABM in tended to repres ent a mas sively complicated en viro nmen tal system includes many para meters to b e set by the user prior to running a simulation. These parameters grea tly 6 aﬀect the outcomes of the simulation, but it is still diﬃcult to predict the r esults o f the simulation due to the innate s to chasticit y of ABMs [ 2 1 ]. The required MutAntiGen parameter s and their biolo gical int erpre tations, as well as our computational mo diﬁca tions to the original co de that allow for b etter sampling at scale, a re discussed in Supplemen tary Mater ials Sections S1 & S2 . 3.2.2 Design to Pro duce MutAntiGen Runs Our g oal in g enerating synthetic data was to pro duce a rich and diverse set of training data for a forecasting mo del, emphasizing eﬀective r epresentation of b oth a nt igenic mutation (an individual prop erty) and turnov er in dominant antigenic type (a prop erty of p opulations). Since the a im is to forecast emergent disea ses, w e are unlik ely to know ahea d o f time wha t para meter choices will b es t reﬂect an impe nding outbre ak; therefore these simulations m ust c ast a s wide a net as is pra cticable, including a broa d r ange of mo del parameter s so that future outbre aks would be more lik ely to fall within that net (i.e., b e repr esented in the sample space). T o select the parameter v alues we run MutAn tiGen a t, we draw a Latin hypercub e sample (LHS) [ 42 ]. O ur defa ult MutAn tiGen parameter v alues and ranges were informed by the liter ature and calibrated to r epresent global inﬂuenza H3N2 phylodyna mics. W e expanded a subset of the parameters that control the evolutionary , epidemiolo gical, a nd immunological dynamics to yield more general simulations. F o r pa rameters with es tablished empirical estimates, we set plaus ible bo unds based on data from representativ e RNA v iruses including inﬂuenza A, inﬂuenza B, Measles, Nipah, Dengue, Zik a, and Hepatitis C viruse s. The sp eciﬁcs of these b ounds, our rationa les for s electing them, and a full list o f MutAnt iGen par ameters held consta nt a re av ailable in Supplementary Materials Sec tion S1 . By natur e of LHS’s indep endent sa mpling, many combinations of unrealistic or unknown vir uses could b e generated, with pa rameter combinations that do not r esp ect k nown biolo gical constra int s (e.g., erro r catastr ophe, mutation scaling ra tes [ 16 , 2 3 ]) or o utbreak conditions ( R 0 > 1). Howev er, we allow ed such combinations under the as sumption that biologica lly nonviable regimes would result in simulation failure o r additional data that is harmless. RunID = 674 RunID = 706 RunID = 1384 RunID = 1519 TC V AC 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 Normalized Cases Synthetic Runs Figure 3: MutAntiGen example r uns. MutAntiGen outputs bo th the total num be r of cases (top row, TC) a nd the time series of ca ses attributed to each v a riant (b ottom row, V A C; each line and color represents a diﬀeren t v ariant). F or ea ch time p o in t, the sum o f all v ariant-attributable ca ses (b ottom row) eq uals the total ca ses (top r ow). 3.2.3 Observ ation M o del for Synthetic Data As can b e seen in Figure 3 , the output of MutAntiGen can hav e unrealis tically low noise. Real da ta, in contrast, ar e nois y , biased, and often include outliers (compar e Figure 2 to Figure 3 ). That is , real data ca n b e thought of as imp er fect versions o f idealized epidemiolog ical data. In an eﬀort to make the sy nt hetic o utputs o f MutAntiGen mor e rea listic, we passed each MutAntiGen output through an observ ation mo del 20 times, res ulting in diﬀere n t imp erfect versions of ea ch MutAn tiGen time ser ies. This obse rv a tion mo del increases b o th the amount a nd the diversit y of the training data. Figure 4 7 shows diﬀerent r ealizations of the obs erv a tion mode l for a sing le MutAn tiGen o utput. After sending all MutAn tiGen runs through the obser v atio n mo del pro c ess, we generated appr oximately 36,000 total time ser ies fo r training for b oth the total ca ses and the v ariant-attributable cases (see T able 3 for sp eciﬁcs). Mor e details of the obse rv a tion mo del can b e found in Section S3 of the Supplementary Materials. Realization 1 Realization 2 Realization 3 Realization 4 Realization 5 Scaling + Outliers Scaling + Outliers + Noise 0 100 200 300 0 50 100 150 200 0 100 200 300 0 20 40 60 0 100 200 300 400 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Normalized Cases Figure 4: 10 of the 20 r ealizations from the obser v atio n mo del cor resp onding to a single MutAntiGen output. Realizations were generated by sub jecting the “ clean” MutAntiGen output to either scaling (random-mag nitude compr ession of the x- axis) and (poss ible) addition of outliers (top row), or to scaling plus addition o f noise and (p ossibly ) outliers (b ottom row). 4 F orecasting Mo del 4.1 T raining W e frame our pro babilistic for ecasting problem as a conditional qua nt ile reg ression pr oblem. Let y t + h | y in t ∼ F y in t ,h (1) be the conditional distribution for y t + h , the num ber of new infectio ns rep o rted at time t + h for t ∈ { 1 , 2 , . . . } and h ∈ { 1 , 2 , . . . , H } g iven the last C newly rep orted infections y in t = y ( t − C +1): t . In this pap er, we set C = 2 0 and H = 4. Tha t is, o ur 1-, 2-, 3-, or 4 -step-ahead forecasts ar e ba sed on the last 20 o bserv ations o f the time series whe n making a fo recast at time t . W e deﬁne the qua nt ile function a s follows: q τ ,h ( y in t ) := F − 1 y in t ,h ( τ ) (2) for τ ∈ [0 , 1]. That is , the conditional quantile function q τ ,h () — doubly indexed b y the quantile lev el τ and the step a head h — is the inv erse of the co nditional cumulative distribution function ev aluated at the quantile level τ , F − 1 y in t ,h ( τ ). W e a pproximate the inv erse of the conditional cumulativ e distribution function F − 1 y in t ,h with a set of quantile functions q τ ,h ( y in t ) e v alua ted over a grid of quantile levels τ ∈ T where T = { 0 . 0005 , 0 . 0 05 , 0 . 01 , 0 . 025 , 0 . 05 , 0 . 1 , . . . , 0 . 9 , 0 . 95 , 0 . 975 , 0 . 99 , 0 . 995 , 0 . 9995 } and |T | = 27. T constitutes a dense gr id of quantile levels, denser than we use for ev aluatio n (s ee Section 5.1 for details). This dense gr id, howev er, allows us to b etter a pproximate the tails of the forecast distributions which will b e needed in the forecas ting of mo dels M(r,v), M(st,v), M(sv,v), and M(a,v) (see Section 4.2 for details). The quantile function provides a point for ecast a nd fo recast int erv als. F o r insta nce, the p oint f orec ast is the ev aluated qua ntile function when τ = 0 . 5 (the median). The 95 % fo recast interv al lower and uppe r b ounds are the quantile functions ev a luated for q uantile levels τ = 0 . 0 25 and τ = 0 . 975 , res pe ctively . 8 Given our pr oblem statement, our next task is to estimate q τ ,h ( y in t ). T o do that, w e turn to deep learning [ 36 ]. Deep learning models ar e ﬂexible function appr oximators. Provided a n adequate amoun t of training da ta, deep learning mo dels can lea rn co nt inuous functions to high deg rees o f pr ecision [ 24 ]. Using the tra ining data describ ed in Sectio n 2 , we train a 2-layer trans former mo del [ 56 ] that takes the last 20 observ ations of a time series as input and predicts the quantile levels in T for h ∈ { 1 , 2 , . . . , H } . Pinball loss is used to perfo rm the quantile regres sion. T raining and deep lear ning mo del de tails can be found in Section S4 of the Supplemental Materia ls. The res ults of the mo del ﬁtting a re fo ur diﬀerent tra ined deep learning mo dels, diﬀering only in the training data used to lear n the model para meters: rea l da ta only (i.e., all non- COVID-19, re al respirato ry data detailed in Section 3.1 ), synthet ic total cases, syn thetic v ariant-attributable case s, and all tra ining data (i.e., no n-COVID-19, r eal r espirator y data , synthetic total cases data, and synt hetic v ar iant-attributable cas es). As is detailed in T able 4 , with C = 2 0, we ar e able to genera te b etw een 2 and 2 1 millio n input/output pairs of tra ining data where y in t is the input and y out t = y ( t +1):( t + H ) is the output. T o b e clear, if there is a training time series of length T = 100, that will pr o duce 100 − 20 − 4 + 1 = 77 input/output training data examples (e.g., { y in C , y out C } , { y in C +1 , y out C +1 } , . . . , { y in T − C − H +1 , y out T − C − H +1 } ). T a ble 4: T raining details for each training data set. Each model was presented with approximately 25 million training examples during learning. Ex ample views is the a verage num b er of times each t raining example was viewed by the mo del during tra ining (total num b er of training examples viewed by the mo del divided by the total num b er of unique av ailable tra ining exa mples). The clo ser e xample views is to 1 (or less), the less likely the mo del is to memorize the training da ta and thus the less likely the mo del is to ov erﬁt the tr aining da ta. T r aining time do es not scale with the num ber of training examples but ra ther the num ber of training exa mples presented to the mo del during tr aining, which was held ﬁxed at a pproximately 25 million for all mo dels. T raining times presented here use d a 2023 MacBo ok P ro with Apple M2 Ma x, using CPU-o nly training on 1 CPU core, with 64 GB ram. Co de was r un in R 4 .4.3 with mo dels ﬁt using the to rch pack age (version 0.16 .3). T raining Num b er of T raining Example T raining Time Data Examples (mill ions) Vi ews (min utes) Real 2.1 11.8 706 Synt hetic, TC 8.6 2.9 701 Synt hetic, V AC 9.8 2.5 701 All 20.6 1.2 702 T a ble 4 pr esents summaries of mo del training. Each model was trained on b etw een 2 and 21 million unique tr aining examples. E ach mo del was presented with approximately 2 5 million training examples during tr aining. Th us, ea ch training exa mple was viewed by the mo del b etw een 1 a nd 1 2 times (example views), dep ending on the n umber of av aila ble training examples. Deep learning mo dels run the risk of memorizing the training data and th us overﬁtting if they are presented the same training examples rep ea tedly . The la rge num be rs of tra ining examples shown in T able 4 help protect aga inst ov erﬁtting. The training time is a function o f the num b er of tr aining ex amples pr esented to the mo del (25 millio n), not the num b er of av ailable training examples. This is why the tr aining time do e s not scale w ith the num ber o f tr aining e xamples. The time to tr ain the mo del (almost 12 hours) w ould be considere d exp ensive (p ossibly prohibitive) if retra ining was req uired every week when new data bec ome av aila ble for rea l-time for ecasting. The foreca sting approa ch cons idered her e falls under the “exp ensive to train but cheap to deploy” para digm. While it takes on the order of half a day to train these mo dels, they only ne ed to b e trained once . F ore casting with these tra ined mo dels is measured on the sca le of seco nds (or less), making them app ealing in op era tional settings. It is worth noting that the tra ining time could b e s hortened by making us e of gr aphical pro cessing units (GPUs). 4.2 F orecasting F o recasting diﬀers dep ending o n the mo del input (r ecall T able 1 ). The mo dels that take total cases as the input ar e stra ightforw ard to for ecast. The ﬁtted transfo rmers pro duce foreca sts for all quantile 9 levels in T . W e only use seven of those quantile levels, τ ∈ { 0 . 025 , 0 . 1 , 0 . 25 , 0 . 5 , 0 . 75 , 0 . 9 , 0 . 975 } , allowing us to pr o duce a p oint forecas t (the median) and thr ee predictio n interv als: 50%, 80 % and 95%. These quantile predictio ns allow us to ev aluate forecasts with respect to m ultiple p opular metrics (see Section 5.1 for details). F o recasting the mo dels wher e the mo del input a re the v ar iant-attributable cas es (V ACs) requires one more step. F or a given state, fore cast date ( t ), and for ecast horizo n ( h ), we forecast each V AC a t the de nse gr id of 27 quantile levels in T . Using those 27 quantiles as an e stimate of the cumulativ e distribution function (CDF) for each V AC, we dr aw a realization from ea ch V AC’s CDFs by inv erting the CDF [ 14 ] and linearly in terp olating b etw een quantile estimates (t his is why T has quan tile estimates so far out in the tails). Given a draw from eac h V AC’s for ecast distr ibution, we sum those dr aws constituting a single draw from the t otal cases forecast distr ibution. W e repeat this sampling pro cess N = 10 0,000 times. Then, we compute the same seven quantile levels { 0 . 025 , 0 . 1 , 0 . 25 , 0 . 5 , 0 . 75 , 0 . 9 , 0 . 975 } as the s ample qua n tiles of the N draws from the total cases foreca st distribution, derived from each individual V A C for ecast distribution. Notice that pair s of mo dels in T able 1 use the same ﬁtted tr ansformer mo del to pro duce forecas ts. F o r exa mple, mo dels M(r,t) and M(r,v) each use the s ame ﬁtted tra nsformer mo del, trained with o nly non-COVID-19, real respiratory data, but will yield diﬀerent forecasts bec ause the inputs to the mo del are diﬀerent (reca ll Figure 1 (a) and (c)). 5 F orecasting Results In Section 5.1 , we detail the metr ics used to per form o ur ev a luation. In Section 5 .2 , we provide high- level ﬁndings from our ex ercise. In Sectio n 5.3 , we directly answer the rese arch questio ns stated in Section 1 . 5.1 Ev aluation Metrics Before jumping int o the res ults, we deﬁne the metrics we use to ev a luate the forecas ts: mea n abs olute error (MAE), weigh ted interv al score (WIS), a nd empirica l cov erage. MAE is deﬁned as MAE = 1 N N X i =1 | ˆ y i − y i | (3) where N is the num ber of for ecasts, y i is the obs erv a tion, and ˆ y i is the p oint foreca st. MAE is greater than or eq ual to 0 and is negatively o riented (smaller is better ). The p o int foreca st ˆ y for this work is the median forecast fr om the quantile regr ession. Most of the ev a luation results are presented r elative to a per sistence mo del — a mo del whose forec ast ˆ y t + h = y t for a ny h ≥ 1 (i.e., M(0 )). As such, we also deﬁne rela tive MAE (rMAE) as follows: rMAE(M i ) = MAE for M i MAE for M(0) (4) where M i is any model i listed in T able 1 . r MAE is gr eater than or equa l to 0 and negatively oriented. rMAE(M i ) < 1 indica tes that mo del M i has a b etter MAE than a p er sistence mo del. WIS is a wa y to ev a luate for ecasts in an interv a l format. Intuit ively , WIS p enalizes tw o things: int erv al widths (the wider the interv a l width, the la rger the p enalty) and observ ations that fall outside the foreca st interv al (the further the o bserv ation falls o utside the forecast interv al, the larger the pena lty). As such, WIS enco urages forecasts to be sharp a nd well-calibrated [ 22 ]. W e co nsider K = 3 central (1 − α ) × 1 00% forecast interv als: 5 0%, 80 % and 95 % (cor resp onding to α = 0 . 5 , 0 . 2 , a nd 0 . 05 , resp ectively). F ollowing [ 6 ], WIS is deﬁned as WIS = 1 K + 0 . 5 × w 0 × | y − q 0 . 5 | + K X k =1 ( w k × IS α k ) ! (5) where w 0 = 0 . 5, w k = α k / 2, q 0 . 5 = ˆ y (the median forecast), y is the observ ations, and IS α k is the int erv al score corr esp onding to α k , deﬁned a s 10 IS α k = ( u α k − l α k ) + 2 α k ( l α k − y ) ∗ I ( y < l α k ) + ( y − u α k ) ∗ I ( y > u α k ) ! (6) where l α k and u α k are the α k / 2 a nd 1 − α k / 2 q uantiles and I () is an indicator function equal to 1 if the ar gument is true and 0 o therwise. WIS is grea ter than or equal to zero and is negatively or iented. Similar to MAE, r elative WIS (rWIS) is deﬁned as follows: rWIS(M i ) = WIS for M i WIS for M(0) (7) where M i is mo del i . rWIS(M i ) < 1 means mo del M i has a b etter WIS than a p er sistence mo del. The per sistence model ( ˆ y t + h = y t ) do es not intrinsically pr o duce for ecast interv als. W e follow the procedure describ ed in the Methods Section of [ 12 ] where the forecast in terv a ls of the per sistence mo del are based on the historic al changes of the o bserved COVID-19 cases time series for e ach sta te. Empirical coverage for a (1 − α ) × 100% forecas t in terv al is the prop or tion of for ecast interv als that contain the o bserv ation y . Empir ical cov erag e is deﬁned as Cov erage( α ) = 1 N N X i =1 I ( l i,α ≤ y i ≤ u i,α ) (8) where l i,α and u i,α are the low er and upp er quantile estimates corres po nding to the (1 − α ) × 100% forecast interv al. Pr obabilistically well-calibrated forecasts ar e those whose empir ical coverages ar e nearly equal to their nominal coverages (i.e., 1 − α ) for all α levels. 5.2 Ov erview of Results Figure 5 shows selected for ecasts for New Mexico for all mo dels. As exp ected, for ecast interv al widths increase with incr easing horizon h . The forecasts for mo dels trained on only real data (M(r,t) and M(r,v)) hav e no ticeably low quantile es timates at the 0.0 25 level. The real data used for tra ining are fairly noisy , which may hav e led to this forecas t b ehavior. The other notable trend is the forecasts made at the p eak of the Omicr on (BA.1) wa ve in early 2022. The mo dels that take TC s as input, M(.,t) (left column o f Fig ure 5 ), pro duced ﬂa t to mildly dropping foreca sts a t the p eak in early 20 22, while the mo dels that ta ke V ACs as inputs, M(.,v) (right column of Figure 5 ), corr ectly pro duced a steep drop in their for ecasts. Figure 5 illustrates that ther e are sys tematic diﬀerences in the forecast mo dels that brea k down along training data and mo del input lines. 11 0 4,000 18,000 40,000 71,000 111,000 2020 2021 2022 2023 T otal Cases M(r ,t) 0 6,000 25,000 57,000 101,000 158,000 2020 2021 2022 2023 T otal Cases M(r ,v) 0 7,000 29,000 66,000 117,000 2020 2021 2022 2023 T otal Cases M(st,t) 0 4,000 16,000 37,000 66,000 102,000 2020 2021 2022 2023 T otal Cases M(st,v) 0 6,000 26,000 58,000 103,000 160,000 2020 2021 2022 2023 T otal Cases M(sv ,t) 0 7,000 26,000 59,000 104,000 2020 2021 2022 2023 T otal Cases M(sv ,v) 0 7,000 29,000 65,000 116,000 2020 2021 2022 2023 T otal Cases M(a,t) 0 5,000 18,000 41,000 74,000 115,000 2020 2021 2022 2023 T otal Cases M(a,v) New Me xico Figure 5: Sele cted 1 -week-ahead through 4 -week-ahead f ore casts for New Mexico for all mo dels. Black line: tota l cases time s eries. Co lored p oints: median forecas t cases . Ribb ons mark the 50%, 80%, and 95% forecast interv als. Note: y -axis is on a sq uare ro ot scale to b etter see the low-case-co unt forecasts. Figure 6 shows rMAE re sults. Overall, we see a ll mo dels had b etter MAE than the base line, the worst b eing M(sv,t) with 0.928 rMAE and the b est b eing M(a,v) with 0.77 7 rMAE . Uncer taint y int erv als are 95% conﬁdence interv als , obtained via b o otstra pping (see Sec tion S5 for details). Each mo del had r MAE less tha n or e qual to 1 for a ll forecast horizo ns. The b ottom of Figur e 6 shows a running r elative MAE where the r MAE on a given date cor resp onds to the ev a luation o f all fo recasts av ailable b etw een June 1st, 2020 thr ough the x-axis date. It is clear the results dep end on the ev a luation per io d, as mo dels perfo rm diﬀerently throughout diﬀer ent pha ses of the pandemic. That said, for all ev alua tion end times after J anuary 2 021, a ll mo dels had an rMAE less than 1 (except for a momentary blip for a few mode ls around Ja nu ary 20 22). 12 1 0.909 0.894 0.928 0.873 0.851 0.818 0.806 0.777 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 M(0) M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) Relative MAE Relative MAE b y Model 0.80 0.85 0.90 0.95 1.00 1 2 3 4 Horizon Relative MAE M(0) M(r,t) M(st,t) M(sv ,t) M(a,t) M(r,v) M(st,v) M(sv ,v) M(a,v) Relative MAE b y Model 0.80 0.90 1.00 1.10 1.20 Jul 2020 Oct 2020 Jan 2021 Apr 2021 Jul 2021 Oct 2021 Jan 2022 Apr 2022 Jul 2022 Oct 2022 Jan 2023 Relative MAE M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) Running Relative MAE Figure 6: (top left) Overall r elative mean absolute error (rMAE) with 95% conﬁdence interv als as error bars. Error bars were constructed via b o o tstrapping. (top r ight) Relative MAE by for ecast horiz on. (bo ttom) Running relative MAE. The running relative MAE for ea ch da te is the rela tive MAE if the ev alua tion p erio d ran fro m June 1st, 2 020 through the x -axis date. Figure 7 shows r esults for WIS relative to a pe rsistence mo del. Overall, all mo dels ha d r WIS le ss than 1 rang ing from 0.76 9 (M(r,t)) to 0.63 (M(a,v )). Each mo del also ha d r WIS les s than 1 (in fact, less than o r eq ual to 0 .80) for a ll considere d foreca st hor izons. The running r elative WIS is less tha n 1 for all mo dels and all ev a luation end dates b etw e en June 20 20 throug h D ecember 2022 pro viding strong evidence that all mo dels demonstr ated b etter forecast p erformance as measured by WIS co mpared to M(0). 13 1 0.769 0.74 0.767 0.728 0.697 0.676 0.653 0.63 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 M(0) M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) Relative WIS Relative WIS b y Model 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1 2 3 4 Horizon Relative WIS M(0) M(r,t) M(st,t) M(sv ,t) M(a,t) M(r,v) M(st,v) M(sv ,v) M(a,v) Relative WIS b y Model 0.60 0.70 0.80 0.90 1.00 Jul 2020 Oct 2020 Jan 2021 Apr 2021 Jul 2021 Oct 2021 Jan 2022 Apr 2022 Jul 2022 Oct 2022 Jan 2023 Relative WIS M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) Running Relative WIS Figure 7: (top left) Overall r elative weighted interv al sco re (WIS) with 95% conﬁdence interv als as err or bars. Error bar s were constructed via bo otstrapping. (top r ight) Relative WIS b y foreca st horizon, and (b ottom) running r elative WIS. The running relative WIS for each date is the r elative WIS if the ev alua tion p erio d ran fro m June 1st, 2 020 through the x -axis date. Figure 8 shows the empir ical coverage for a ll mo dels, overall and by ho rizon. All mo dels e xhibit undercov erag e (i.e., observ ations fell outside their predictive in terv als more often than exp ected), a common o ccurr ence in COVID-19 for ecasting [ 38 ]. All mo dels show ed empirica l coverage s imilar to that of a per sistence mo del for 50% forecast interv als , but empirical coverages closer to nominal than the p ersistence mo del for 95% for ecast in terv als. When we examine empirical co verage by forecast horizon (bottom of Figure 8 ), we genera lly see that all non-p ersistence models hav e near nominal cov erage a t horiz on 1, but that empirica l cov erage drifts b elow nominal coverage a s horizon incr eases to 4 weeks a head. 14 0.39 0.41 0.4 0.43 0.44 0.41 0.36 0.43 0.44 0.66 0.68 0.68 0.71 0.74 0.7 0.65 0.73 0.73 0.8 0.87 0.88 0.9 0.91 0.87 0.87 0.9 0.9 0.5 0.8 0.95 M(0) M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) M(0) M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) M(0) M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Cov erage Cov erage b y Model 0.5 0.8 0.95 1 2 3 4 1 2 3 4 1 2 3 4 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Horizon Cov erage M(0) M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r,v) M(st,v) M(sv ,v) M(a,v) Cov erage b y Model Figure 8 : (top) Overall empirical cov erag es for 5 0%, 80%, and 9 5% forecast in terv als. (b ottom) E m- pirical cov erage b y horizon. The das hed horizontal line indica tes the nominal cov erag e. Undercov erag e exists for all mo dels and horizons . 5.3 Answ ering the Researc h Questions W e now directly answer o ur resea rch questions laid out in Section 1 . 5.3.1 Q1: Do es trainin g wi th real data or syn thetic data pro duce be tter forecast p er- formance? There is fairly s trong evidence that training with synthetic data a lone pr o duces better forecast p er- formance than tra ining with real da ta alone. That evidence is presented in Fig ure 9 , w hich s hows the pa ired diﬀerences o f r MAE and rWIS acros s 5,000 b o o tstrapp ed samples. The mo del co mparisons were M(r,t) vs M(st,t) a nd M(r,v) vs M(sv,v). W e see that in over 75% of bo o tstrapp ed sa mples, mo del M(st,t) outp erfor med M(r,t) in b o th r MAE and r WIS (i.e., the lower edge of the b ox of the b ox plot is at or above 0 ), while M(sv ,v) outpe rformed M(r,v) in nearly 10 0% o f b o otstra pped s amples. 15 rMAE rWIS M(r ,t) vs M(st,t) M(r ,v) vs M(sv ,v) M(r ,t) vs M(st,t) M(r ,v) vs M(sv ,v) −0.15 −0.10 −0.05 0.00 0.05 0.10 Model A − Model B P aired Differences Figure 9: Distribution of 5,0 00 bo otstr app ed mo del diﬀerence s for rMAE and rWIS. T riangles indicate the 2.5 a nd 9 7.5 p ercentiles. Pr esented results s how Mo del A - Mo del B (on the x-axis it is dis play ed as M(A) vs M(B)). F or instance, M(r,t) vs M(st,t) in the rMAE panel pr esents the results of M(r,t)’s rMAE - M(st,t)’s rMAE, a cross a ll b o otstrap samples. Positiv e num b ers indicate Mo del B p er formed better than Mo del A. F airly strong evidence is shown that mo dels trained with s ynt hetic data alone per formed b etter tha n mo dels trained with real data alo ne. 5.3.2 Q2: Do es jo in t training with real and syn thetic data im pro v e CO VID-19 case forecasts relativ e to training with either so urce individual ly? Y es, ther e is s trong evidence tha t tr aining with r eal and synthetic data is b etter than tr aining with either source individually . Se e Figure 10 . In all compar isons, at least 75 % of b o otstr app ed samples resulted in model M( a,.) having b etter perfor mance (rMAE or rWIS) than t he analogous model trained with either r eal data a lone or s ynth etic data alone. In all compar isons presented in Figure 10 , there is no evidence that training with rea l and synthetic data is harmful relative to training with either source individually , mak ing this joint training the safe choice (that is , using all av aila ble training data yielded b etter res ults than a subset). rMAE rWIS M(r ,t) vs M(a,t) M(st,t) vs M(a,t) M(r ,v) vs M(a,v) M(sv ,v) vs M(a,v) M(r,t) vs M(a,t) M(st,t) vs M(a,t) M(r,v) vs M(a,v) M(sv ,v) vs M(a,v) −0.05 0.00 0.05 0.10 Model A − Model B P aired Differences Figure 10: Distr ibution of 5,000 bo otstra pp ed mo del diﬀerences for rMAE and r WIS. T riangles indicate the 2.5 a nd 9 7.5 p ercentiles. Pr esented results s how Mo del A - Mo del B (on the x-axis it is dis play ed as M(A) vs M(B)). F or ins tance, M(r,t) vs M(a,t) in the rMAE panel pres ents the results of M(r,t)’s rMAE - M(a,t)’s rMAE, across all b o otstr ap samples . Positiv e num be rs indica te Mo de l B p erfor med better than Mo del A. Clear evidence is shown that models trained with real a nd syn thetic data (M(a,t) and M(a,v)), on bala nce, p erformed be tter than mo dels trained with only-r eal or only-s ynthetic data. Before mo ving on to question Q3, it’s worth tak ing a moment to contemplate why tr aining on synthetic data a lone o utper formed training on re al da ta alone (Q1) a nd wh y training on b oth sources outp erformed either one individually (Q2). A s imple hypothesis is related to T able 4 : there is more synthetic tra ining data ( ∼ 9 million ex amples) than there is real tra ining data ( ∼ 2 million examples), and there is (by deﬁnition) mo re real plus synthetic tra ining data ( ∼ 20 million ex amples) than there is of either so urce individually and mor e training data equates to b etter mo del p erfor mance. That is, 16 maybe when compar ing M(i,.) to M(j,.) for tw o diﬀerent training data types i a nd j, the p erformance can b e predicted simply by knowing the amount of tra ining da ta av ailable. T o inv estigate this hypothesis, we r etrained the mo dels on training data s izes of { 2.5k, 25 k, 25 0k, 1M, 2.5M, 5M, 10 M } b y taking subsets of the av ailable training data of each type for all training data sizes le ss than o r equal to all the av ailable da ta. So , for c larity , M(r ,.) is tra ined on tr aining da ta se ts of s izes 2.5k, 25k, 250 k, 1M (all training data sizes less than the av ailable 2.1M — recall T able 4 ) as well as 2 .1M (a ll the av ailable training data ). Similarly for mo dels M(st,.), M(sv,.) a nd M(a,.). F or each mode l/training data size, we calculate rMAE and rWIS averaged across all states, for ecast dates, and forecas t horizo ns. If o ur hypothesis is corre ct and the amount of training data is the dr iving force behind foreca st improvemen t, we would ex pec t to see similar p erfor mance for a ll mo dels when trained on the same a mount of tra ining data, holding the input da ta type (TC o r V AC) constant. If, how ever, for the same amount o f training data we see s ome mo dels co nsistently o utper form other mo dels (e.g ., if M(a,.) > M(r,.)), that w ould b e e vidence against o ur hypo thesis and would sugg est the amo un t o f training da ta do es not solely expla in why some mo dels outpe rform o thers. Res ults fr om this exer cise are shown in Fig ure 11 . 0.8 1.0 1.2 1.4 1.6 1.8 2.5K 25.0K 250.0K 1.0M 2.5M 5.0M 10.0M All Number of T raining Data Examples M(r ,t) M(st,t) M(a,t) M(r,v) M(sv,v) M(a,v) rMAE 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.5K 25.0K 250.0K 1.0M 2.5M 5.0M 10.0M All Number of T raining Data Examples M(r ,t) M(st,t) M(a,t) M(r,v) M(sv,v) M(a,v) rWIS Figure 11: The relative mean absolute er ror (rMAE) and relative weigh ted in terv al score (rWIS) v ersus the n umber of training data exa mples for all models (sans M(st,v) a nd M(sv,t) — the synthetic models where the tra ining data and the input data t yp e are mis matched). Mo del p erformance improv es as training data size increases un til ab out one million training examples are used. “All” means the results when all av ailable tra ining examples for e ach mo del are used (num bers av ailable in T able 4 ). Results presented are based o n training r uns with 25k mo del w eight up dates. Figure 11 s hows rMAE and r WIS for incr easing amounts o f training examples . W e see that as the a mount of training data increas es from 2.5 k examples to 1M examples, the p erfor mance o f each mo del gener ally improv es. In this regime, there is evidence that more tr aining data correla tes with better model perfor mance. After mo dels ar e trained with 1 M training examples , ho wev er, for ecast per formance a ppe ars to level oﬀ. Mor e imp or tantly , there do app ear to b e sy stematic diﬀerences in mo del p erformance, even when holding the amount of training data constant. This can more clea rly be seen in Figur e 12 . 17 M(r,t) vs M(st,t) M(r,t) vs M(a,t) M(st,t) vs M(a,t) M(r,v) vs M(sv ,v) M(r,v) vs M(a,v) M(sv ,v) vs M(a,v) rMAE rWIS 25k 250k 1M All 25k 250k 1M All 25k 250k 1M 2M 5M All 25k 250k 1M All 25k 250k 1M All 25k 250k 1M 2M 5M All −0.1 0.0 0.1 −0.1 0.0 0.1 Number of T raining Data Examples Model A − Model B P aired Differences Figure 1 2: The 95% conﬁdence interv als (bars) and averages (p oints) in relative mean absolute error (rMAE) and relative weigh ted interv al score (rWIS) for paired diﬀerences of mo dels (columns) based on 5,000 b o otstr ap samples. T ra ining data sizes a re shown on the x-axis. “All” mea ns the res ults when all av ailable tra ining examples for e ach mo del are used (num bers av ailable in T able 4 ). Results are presented as Mo del A - Mo del B, and p ositive v alues mean tha t Mo del B p erformed b etter than Mo del A. Results o nly shown for 2 5k or more tra ining examples. Figure 12 shows t he 9 5% conﬁdence interv als for pair ed diﬀerences in rMAE a nd rWIS acros s 5,000 bo otstrapp ed samples. While not universal, there is clea r evidence shown in Fig ure 12 tha t the mo del trained with r eal a nd synthetic tra ining da ta outp er forms the mo dels trained with o nly rea l or only synthetic training data even after holding the num be r of training data examples cons tant (2nd, 3r d, 5th, and 6 th co lumns of Figure 12 ). F urthermor e, there is evidence shown that the mo dels trained on synthetic data only outp erfor m the mo dels tra ined on real data only even after holding the num ber of tra ining data examples constant (1st a nd 4th columns of Figure 12 ). Figur e 12 pr esents evidenc e against our hyp othesis that the amount o f training data is the expla nation for M(a,.) outp erfor ming all other mo dels (Q2) and M(st,t)/M(sv,v) outperfo rming M(r,t)/M(r,v) (Q1). That is, something more than training da ta quantit y is needed to explain the ﬁndings in Q1 and Q2 . Our other hypothesis for why training on rea l and synthetic data outp erforms e ither source indi- vidually and why training with synthetic data only outpe rforms tra ining with r eal data o nly centers around c ovariate shift [ 45 ]. Cov ariate shift in supervised machine learning (ML) refer s to the change in the distribution o f the input examples to a ML mo del b etw een the training data (i.e., non-COVID-19, real respir atory data or synthetic data ) and the testing data (i.e., COVID-19 data). Most s upe rvised ML mo dels assume that the tra ining and testing da ta inputs (i.e., in this work, the last 2 0 obser v a- tions of a time s eries) come from a co mmon distribution. When there is a distributional mismatch, predictive p erfor mance can s uﬀer. It co uld b e the cas e that COVID -19 data are b etter repr esented by synthetic data than historical non-COVID-19, respira tory data (i.e., it co uld b e synthetic data a nd COVID-19 data app ear more a like than no n-COVID-19, real respir atory data). T o inv estigate this pos sibility , we ﬁt a gradient b o os ted mo del to per form binary classiﬁca tion trained o n a 150k non-COVID-19, rea l respirator y tr aining exa mples and 150k sy n thetic TC training examples. The inputs to this mo del were the last 2 0 o bserv ations of a time series (mean cent ered a nd standard deviation scaled) and the lab els “Synthetic TC” or “Real” w ere the output. No COVID-19 data were used to train this classiﬁer. W e then ran three diﬀeren t hold-out data sets through the classiﬁer: 1 ) 8,0 00 instances of non- COVID -19 , rea l respira tory data, 2) 8,000 insta nces o f synthetic TC data, a nd 3) all 6,885 instances of COVID-19 data. The non- COVID-19, real re spiratory data and synthetic TC data sets were used to ev aluate the classiﬁe rs ca pabilities. The COVID-19 data were used to see if COVID-19 data b etter resemble non-COVID-19, real respira tory da ta or synthetic TC data. If the COVID-19 data a re c lassiﬁed as “Synthetic TC” data, that would c onstitute evidence that COVID-19 data c ome fro m a distribution mor e similar to synthetic data than non-COVID-19, real respirato ry da ta. Results are shown in Figur e 13 . 18 Non−CO VID−19 Real Respiratory CO VID−19 Synthetic T otal Cases 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 3 6 9 12 Probability Input is Synthetic TC Density Probability Synthetic T otal Cases 0.00 0.25 0.50 0.75 1.00 Non−CO VID−19 Real Respiratory CO VID−19 Synthetic T otal Cases −4 0 4 −4 0 4 −4 0 4 −5.0 −2.5 0.0 2.5 5.0 UMAP1 UMAP2 Probability Synthetic T otal Cases 0.00 0.25 0.50 0.75 1.00 Figure 13 : (top) The distributio n of predicted pr obabilities that input data a re synthetic (total ca ses) from a hold-o ut se t. 8,0 00 ho ld-out exa mples from the non-COVID-19 r eal re spiratory and sy nt hetic total ca ses, r esp ectively are b eing summarized along with 6,885 COVID-19 examples. COVID-19 data are o verwhelmingly classiﬁed as synthetic data rather than non-CO VID-19, real respiratory da ta, while the cla ssiﬁer correctly clas siﬁes sy nt hetic data as sy n thetic data and non-COVID-19, real respirator y data as non-COVID-19, real r espirator y data. (b ottom) UMAP arr angement of hold-out data, color ed by the predicted pro bability sy nt hetic. W e can see the non- COVID-19, rea l respira tory data o ccupies a diﬀerent par t of UMAP spac e that the COVID-19 data, while the COVID-19 da ta and synthetic data have more ov erlap in UMAP s pace, shedding lig ht on why the classiﬁer cla ssiﬁes COVID-19 data as synthetic. Overwhelmingly , COVID -19 data were classiﬁed a s sy nt hetic da ta ra ther than no n-COVID-19, real res piratory data. This is seen in the top of Figur e 13 . Over 90% of COVID-19 instances were assigned a probability over 0.5 that the instance was synthetic TC data. The non- COVID -19 , real respirato ry da ta and the synthetic TC data were, with high pro bability , cor rectly classiﬁed as w ell. Over 90% of sy nt hetic TC instances fro m the hold-out set were correctly assig ned a pr obability ov er 0.5 of being synthetic TC, while just under 80% of the non-COVID-19, real r espirator y data were assigned a probability ov er 0 .5 of b eing non-COVID-19, r eal respirato ry data . These results indicate that the tra ined classiﬁer can discr iminate betw een real and synthetic TC data. The bo ttom of Figure 13 provides a UMAP [ 41 ] (Uniform Manifold Approximation a nd Pro jection) plot — a lo wer dimensional repre sentation of the 20-dimensional inputs. UMAP ﬁnds a low-dimensional embedding of high- dimensional da ta that pre serves lo c al neighborho o d relationships. Points that are close t oge ther in th e 20-dimensional spa ce ar e close together in the 2-dimensional plot in F igure 13 (and po ints that ar e far aw ay rema in far aw ay). This UMAP plot helps explain why the clas siﬁer w as a ble to discriminate b etw een rea l and sy n thetic data a nd explain why COVID-19 da ta was ov erwhelming ly classiﬁed as synthetic data. The non-COVID-19, rea l res piratory data la rgely o ccupies the center o f the UMAP plot. The s ynthet ic data larg ely covers the whole space, but is particularly conce nt rated around the outer ring . The COVID-19 data a lso lar gely o ccupies the o uter r ing and, notably , do es not o ccupy the center of the UMAP shap e. Th us, there is evidence that the COVID-19 data inputs b etter match the distribution of inputs pr o duced b y synthetic data. Or, said another wa y , the cov a riate shift b etw een non-COVID-19, rea l respira tory data (training ) and COVID-19 data (testing ) is muc h more pr onounced tha n b etw een synthetic data (training) and COV ID-19 data (testing). Because the real and sy nt hetic da ta, how ever, o ccupy diﬀeren t r egions o f input spac e, combining them results in improved input space cov erag e, providing ins ight into why M(a,t) a nd M(a,v) outp erfo rmed their mo deling counterparts. 19 5.3.3 Q3a: Does training with s yn thetic V A C data improv e CO VID-19 forecasts relative to training with syn thetic TC data? Y es, training with synthetic V AC data r esulted in b etter for ecasts than training with synthetic TC data. The e vidence for this is shown in the M(st,t) vs M(sv,v) c omparison in Figure 14 . W e see the 95% conﬁdence in terv als do not cover 0, indicating a statistica lly signiﬁcant improvemen t in for ecast per formance for M(sv,v) relative to M(st,t) for as measured by r MAE and rWIS. rMAE rWIS M(st,t) vs M(sv ,v) M(sv ,t) vs M(st,t) M(st,v) vs M(sv ,v) M(st,t) vs M(sv ,v) M(sv ,t) vs M(st,t) M(st,v) vs M(sv ,v) −0.1 0.0 0.1 0.2 Model A − Model B P aired Differences Figure 14: Distr ibution of 5,000 bo otstra pp ed mo del diﬀerences for rMAE and r WIS. T riangles indicate the 2.5 and 97.5 p ercentiles. Presented res ults sho w Model A - Model B (on the x-axis it is display ed as M(A) vs M(B)). F or instance, M(st,t) vs M(sv,v) in the rMAE panel pres ent s the results of M(st,t)’s rMAE - M(sv,v)’s rMAE, acr oss a ll b o otstrap samples . Positive n umbers indica te Mo del B pe rformed better than Mo del A. 5.3.4 Q3b: Do m o dels wi th matc hed training data and input data o utp erform m o dels with mism atc hed training data and input data? Y es, there is c lear evidence that mo dels with matched training data and input data outp erfor m mo dels with mismatched training data and input data. This can b e seen in the M(sv,t) vs M(st,t) and M(st,v) vs M(sv,v) c omparisons in Figur e 14 . In ea ch comparison, the mo del with matched training and input data (M(st,t) and M(sv,v)) outperformed their coun terpart with the same input data t ype but diﬀ erent training data type in a t least 75% of b o otstra pped s amples. This makes in tuitive sense. If the mo del is being tra ined on o ne data type but is then b eing tested o n a diﬀerent input type, there is the p otential for b o th cov a riate shift [ 45 ] (when the distribution of inputs diﬀers b etw een tr aining a nd testing) and concept shift [ 28 ] (when the r elationship betw een inputs and o utcomes diﬀers b etw een training and testing). 5.3.5 Q4: Do es i ncluding SARS-CoV-2 v ariant informatio n im pro v e CO VID-19 case forecasts? Y es, there is clear evidence that for ecasting v a riant-attributable cases directly a nd summing to recover a total cases foreca st improv es foreca sts relative to forecas ting total cases directly . See Figure 15 for results. T o answ er this question, w e compared all pairs of M(.,t) vs M(.,v). These comparisons hold the training data consta n t, a nd only diﬀer in what da ta ar e passed in as inputs (total cases versus v ar iant- attributable ca ses). The 95% conﬁdence interv als fall ab ov e 0 for all comparis ons for bo th metr ics, indicating the mo de ls that take V ACs as inputs and sum to r ecov er total cases forecasts outp erfor m the mo dels that take TCs as inputs dir ectly . Our ﬁnding that foreca sting v ariant-attributable ca ses and summing pr o duces b etter fo recasts than for ecasting total cas es mirrors ﬁndings from inﬂuenza forecasting [ 30 , 54 ]. W e conjecture that V ACs have simpler and mor e predictable dynamics, making them easier to foreca st, than the to tal cases time serie s whic h is a mixture of multiple co-circulating v ar iants. 20 rMAE rWIS M(r ,t) vs M(r,v) M(st,t) vs M(st,v) M(sv ,t) vs M(sv ,v) M(a,t) vs M(a,v) M(r ,t) vs M(r,v) M(st,t) vs M(st,v) M(sv ,t) vs M(sv ,v) M(a,t) vs M(a,v) 0.0 0.1 0.2 Model A − Model B P aired Differences Figure 15: Distr ibution of 5,000 bo otstra pp ed mo del diﬀerences for rMAE and r WIS. T riangles indicate the 2.5 a nd 9 7.5 p ercentiles. Pr esented results s how Mo del A - Mo del B (on the x-axis it is dis play ed as M(A) vs M(B)). F or ins tance, M(r,t) vs M(r ,v) in the rMAE pa nel presents the results o f M(r,t)’s rMAE - M(r,v)’s rMAE, a cross all b o otstr ap samples . Positive num ber s indicate Mo del B per formed better than Mo del A. Clear evidence is presented that mo dels M(.,v) outp erformed mo dels M(.,t). Figure 16 inv estigates during what phas es of the COVID-19 pandemic mo dels M(.,v) outp erfor m mo dels M(.,t). W e partition each t otal cases time series into four phases: imp ending r ise, rise, imp end- ing fall, and fall. Those phases ar e illustra ted in the top of Figur e 16 . W e then co mpute the diﬀere nce in rMAE a nd rWIS by phase for each pair of mo dels. W e see clear evidence that the M(.,v) mo dels outp erform the M(.,t) mo dels during the imp ending fa ll and fall phas es of the pandemic. V A C a nd TC trained mo dels p erfor m more s imilarly to one another during the imp ending rise and ris e phases, with TC mo dels tending to outp er form V AC mo dels when performa nce is measured by MAE. The degree to which the TC mo dels o utper form the V AC mo dels during the r ising phase(s), how ever, is small r elative to the degr ee the V A C mo dels outp erfor m the TC mo dels dur ing the imp ending fall and fall phases. 21 0 50000 100000 150000 2021−01 2021−07 2022−01 2022−07 2023−01 T otal Cases Impending Rise Rise Impending F all F all Alabama rMAE rWIS Impending Rise Rise Impending F all F all Impending Rise Rise Impending F all Fall −0.2 0.0 0.2 0.4 0.6 0.8 T otal Cases − V ar iant Attributable Cases M(r ,t) vs M(r ,v) M(st,t) vs M(st,v) M(sv ,t) vs M(sv ,v) M(a,t) vs M(a,v) Figure 16: (top) An example (Alabama) of how the tota l cases time ser ies is partitioned into four phases: imp ending rise, rise, imp ending fall, and fall. (bottom) The diﬀerence in rMAE and r WIS betw een models, br oken down by phase for a ll states (not just Alabama). Points are a verage diﬀerences and err or bars a re 95% conﬁdence interv als based on 5 ,000 b o o tstrap samples. V alues grea ter than 0 mean the V AC mo dels (M(.,v)) outp er formed their corre sp onding TC mo del (M(.,t)). W e see that the V AC mo dels mea ningfully outp erfor m their TC counterparts during the imp ending fall and fall phases of the outbreak and p er form more similarly to the TC mo dels dur ing the imp ending rise and rise phases. 5.3.6 Q5: How do these forecasts compare to real-time COVID -19 case forecasts? T o answer this que stion, we refer to the results presented in [ 38 ]. E v a luation in [ 38 ] included r elative WIS and cov erage of 1 through 4 week ahea d forecasts fro m July 28th, 20 20 throug h Dece m b er 21st, 2021. See Figure 17 for WIS and cov erage results for all models r estricted to July , 2020 through Decem b er, 2 021. All eig ht models outp erfor med a p ersistence (baseline) mo del (a feat o nly 7 out o f 22 mo dels acc omplished in real-time). F our models outperformed the COVIDHu b-4 week ensemble mo del with a r elative WIS o f 0.81 [ 38 ]. These mo dels were the mo dels trained on synthetic data only where the tra ining data type matched the input data t yp e (M(st,t) and M(sv,v)) a s w ell a s the tw o mo dels trained with all the av ailable training data (M(a,t) and M(a,v)). The mo dels that did not outp erform the COVIDHub-4 week ensemble mo del were the mo dels trained exclusively on real training data (M(r,t) a nd M(r,v)) and the mo dels trained exclusively on synthetic da ta with a misma tch b etw een training data type and mo del input type (M(sv,t) a nd M(st,v)). How ever, even the worst p erfor ming mo del conside red in this pap er, M(r,t), would have b een the 5th b est p erforming mo del among the real-time forecasting mo dels summarize d in [ 38 ]. F urthermore, the COVIDHub-4 week ensemble had a 95% cov erage o f 0.8. All eight mo dels had a n empirical cov erage betw een 0.84 and 0.88, closer to nominal than did the COVIDHub-4 week ensemble. Additionally , although the V A C data input mo dels (M(.,v)) could not hav e b een for ecast in real-time due to v a riant repor ting delays, the TC data input mode ls (M(.,t)) could have b een fo recast in rea l-time. It is also worth a r eminder that none of the eig ht mo dels in this pap er used a ny COV ID-19 da ta for training; they only used data (real a nd/or synthetic) that would hav e b een av ailable o n or b efore January 1 st, 202 0. 22 0.889 0.767 0.848 0.775 0.854 0.832 0.775 0.765 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) Relative WIS Relative WIS b y Model: 2020−07−28 − 2021−12−21 0.86 0.85 0.87 0.84 0.88 0.84 0.87 0.86 0.00 0.25 0.50 0.75 1.00 M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) 95% Cover age Cov erage by Model: 2020−07−28 − 2021−12−21 Figure 17: (left) Relative weight ed interv al score (rWIS) co rresp onding to the e v alua tion per io d of July 28 th, 2020 through Decem b er 21 st, 2021. Low er is b etter. The COVIDHub-4 week ensemble had a relative WIS of 0.81 [ 38 ] (the ho rizontal das hed line). Mo dels with rWIS b elow the horizo ntal, dashed line o utper formed the COVIDHub-4 week ensemble a s mea sured by rWIS. (right) Empirical cov erages for a 95% no minal predictio n interv al (so lid line). The COVID Hub-4 week ens emble’s 9 5% prediction interv als had an empirical coverage of 0.8 (dashed line). All mo dels ha d empirical cov erag es closer to nominal than that. 6 Discussion In this pa p er, we soug h t to answer tw o high-level q uestions: (1) Ca n s ynt hetic data b e used to train deep lear ning mo dels that achieve state-of-the-a rt infectious disease forecasting per formance? and (2) Can genetic information b e used to improve infectious disease foreca sting mo dels? W e c onducted an exercise to a nswer these questions with a n eye tow ards the next pandemic by limiting ourselves to only using training data a v a ilable prio r to January 1st, 2 020 fo r fo recasting CO VID-19 ca ses. W e aﬃrmatively answered b oth questions . Syn thetic da ta was a v aluable tr aining data so urce, as models trained on only synthetic data provided impressive forecas ting perfo rmance. Real his torical da ta for non-COVID-19 pathogens also pr ov ed to b e a v alua ble tra ining data source, thoug h less v a luable than synthetic on its own (Q1). Combining r eal and synthetic data was found to b e a prudent pa th forward, as mo dels tra ined with all a v a ilable tra ining data outp erfor med mo dels trained on rea l or synthetic data alone (Q2). F urthermore, foreca sting v ariant-attributable case s directly and summing of foreca sts pro duces demonstrably better forec asts than forecasting to tal cases directly (Q4). Mo del M(a,v), the mo del that made use of all av a ilable training data and used v ariant infor mation, was the bes t p erforming mo del. The deep lea rning models tr ained and used in this exer cise require d training onc e, but required no re training. While deep learning mo dels can take s everal hours to tra in (recall T able 4 ), they take seconds to pre dict with a nd c an sca le to a lar ge num ber of forecast s ettings. While we trained our mo dels o nce, we could hav e either r etrained every w eek as new data bec ame av aila ble, or, more commonly , p erfor med ﬁne-tuning on our pr e-trained mo del with the new COVID-19 data a s it b ecame av ailable [e .g., 25 ]. Thus sca ling a for ecasting mo del fro m, say , US states to all US co un ties would b e straightforward. This is in co ntrast to a s patial or hierarchical mo del that b orrows informa tion acro ss geogr aphies, where retraining may b e r equired e very time new data are made av ailable and s caling up from tens to hundreds o r thousands of geog raphies is not trivial [e.g., 46 ]. Again, these deep le arning mo dels require adequa te a mounts of training data to b e eﬀective, tra ining data that are not av ailable in an emerging dise ase s etting. Historical outbreak data of rela ted dis ease and/o r s ynthet ic da ta prov ed to b e useful training data. This pa pe r adds to the growing b o dy of evidence that sy n thetic da ta is a v aluable a nd under -tapp ed resource for infectious disease forecasting [ 18 , 4 4 , 43 ], as it provides the necessary ingredient to unlock the p otential of deep lear ning mo dels . There are many opp ortunities to push this dee p learning plus s ynthetic data idea further. Going back to the input/output dee p learning framework, man y future fo recasting opp ortunities for improve- men t a mount to augmenting the inputs. F o recasting geogra phic loc ations jointly ra ther than one at a time is a n exciting path forward (recent rea l-time inﬂuenza forec asting success has employ ed a hi- 23 erarchical mo deling appro ach that bo rrows information acr oss geogr aphies [ 46 , 10 ]). This amounts to developing training examples wher e, say , r ecent obser v a tions fro m all US states ar e the input and the next H time s teps fo r each US state is the output, allowing the deep lear ning mo del to lear n (p oten- tially) complex inter-state r elationships. T o lear n these relatio nships, how ever, will re quire s uitable training data. MutAn tiGen has m ulti-lo cation capabilities thus could b e a suita ble sy n thetic data generator . Another application is jointly for ecasting cases, hospitalizations, and deaths (as was do ne in [ 18 ]). Rather than for ecasting v ar iant-attributable cases , one could co nsider computing popula tion genetic summaries of th e viral popula tion as time se ries and pro viding those as inputs pair ed wit h total cases. This approach would requir e a synth etic data s im ulator with a ppropria te evolutionary biology ﬁdelit y . In gener al, for this deep le arning plus syn thetic data idea to scale will r equire sim ulators sophisticated enough to make tr aining data with suﬃcient r ealism, which will likely r equire contin ued simulator developmen t. While synthetic data is a promising resource to use, it is not a panacea . How to generate it requires thoughtful contemplation of the problem a t hand (recall Se ction S3 ). F or instance, for muc h of the COVID-19 pandemic, da ta was collec ted a nd diss eminated in the US at daily resolution, intro ducing day-of-w eek eﬀects. Such eﬀects are learnable and exploitable, but for a deep learning mo del to incorp ora te them, it would need to b e presented with training data re ﬂecting those patterns. W e did not cre ate any syn thetic da ta having day-of-w eek eﬀects, so we anticipate our appro ach might fail if applied to daily data. That said, to foreca st at the daily scale, we c ould add a day-of-week routine to Section S3 . Similarly , to use this appr oach t o foreca st seasona l diseases, we would wan t to b oth ensure we ar e synthetically ge nerating seasona l outbreaks as well to increase the co ntext window (up to, say , C = 52 or 104), ca pturing at lea st one entire p erio d of the outbrea k. Finally , while we conducted this retr osp ective for ecasting study with an eye tow ards real-time forecasting, we ackno wledge that rep orting delays e xist with b oth e pidemiological and genetic data. Our results r epresent b est-case scenar ios with r esp ect to r ep orting delays and more work is needed to bridge the r eal-time r ep orting delay gap. 7 Ac kno wledgmen ts W e gra tefully ackno wledge all data c ontributors, i.e., the Author s and their Or iginating lab ora tories resp onsible for obtaining the spe cimens, and their Submitting lab or atories fo r generating the genetic sequence and metadata a nd shar ing via the GISAID Initiative, on which some of this research is based. Research pre sented in this article was supp o rted by the La b orator y Directed Resear ch and Developmen t pro gram of Los Alamos National La bo ratory under pro ject num b er 2 0240 066DR. Los Alamos Na tional Lab or atory is o p erated by T riad Nationa l Security , LLC, for the Na tional Nuclear Security Administra tion o f U.S. Departmen t o f Ene rgy (Contract No. 8 92332 18CNA000 001). The authors ackno wledge the us e of Cha tGPT for text editing a nd assistance with s elected co ding tasks. The author s retain full resp o nsibility for the manuscript’s conten t. This ar ticle approved for unlimited release (LA-UR-26-2 2310 ). 24 References [1] Badr , H. S., Zaitchik, B. F., Ke rr, G. H., Nguyen, N.-L. H., Chen, Y.-T., Hinson, P ., Co lston, J. M., Kosek, M. N., Dong, E., Du, H., Marshall, M., Nixon, K., Mohegh, A., Goldber g, D. L., Anen b erg, S. C., and Gar dner, L. M. (2023 ). Uniﬁed re al-time environmen tal-epidemiolo gical data for multiscale mo deling of the covid-19 pa ndemic. Scientiﬁc D ata , 10(1):3 67. [2] Bedfor d, T ., Rambaut, A., a nd Pascual, M. (2 012). Ca nalization of the ev olutiona ry tr a jectory o f the human inﬂuenza virus. BMC Biolo gy , 10 (1):38. [3] Beesley , L. J ., Moran, K. R., W agh, K ., Castro, L. A., Theiler, J., Y o on, H., Fischer, W., Hen- gartner, N. W., K orb er, B., a nd Del V alle, S. Y. (2023). SARS-CoV-2 v a riant tr ansition dyna mics are ass o ciated with v accination ra tes, num ber of co-cir culating v ar iants, and conv alescent immunit y . EBioMe dicine , 91. [4] Beesley , L. J., O sthus, D., and Del V alle, S. Y. (202 2). Addres sing delay ed case rep or ting in infectious disea se forecast mo deling. PL oS Computational Biolo gy , 1 8(6):e101 0115. [5] Bona be au, E. (2002). Age n t-based mo deling: methods and techniques for s imu lating human s ys- tems. Pr o c Natl A c ad Sci U S A , 99 Suppl 3(Suppl 3):728 0–728 7. [6] Bra cher, J., Ray , E. L ., Gneiting , T., and Reich, N. G. (20 21). Ev a luating e pidemic forec asts in a n int erv al format. PL oS Computational Biolo gy , 1 7(2):e100 8618 . [7] Bra uer, F. (2008). Compar tment al mo dels in epidemiolog y. Mathematic al Epidemiolo gy , page s 19–79 . [8] Brito , A. F., Semenov a, E ., Dudas, G., Has sler, G. W., K alinich, C. C., Kr aemer, M. U. G., Ho, J., T eg ally , H., Githinji, G., Agoti, C. N., Matkin, L. E., Whittaker, C., sequencing gr oup, B. S.-C.-., (Australia, C. D . G. N., Zea land), N., Pro ject, C.-. I., Consortium, D. C.-. G., Net work, F. C.-. G. S., core cur ation team, G., for Genomic Surveillance in South Africa (NGS-SA), N., Consortium, S. S.- C.-. S., Howden, B . P ., Sintc henko, V., Zuck erman, N. S., Mo r, O., Bla nkenship, H. M., de O liveira, T., Lin, R. T. P ., Mendon¸ ca Siqueira , M., Resende, P . C., R. V asconcelos , A. T., Spilki, F. R., Santana Aguiar, R., Alexiev, I., Iv anov, I. N., Philip ov a, I., Carr ington, C. V. F., S. D. Sa hadeo, N., Branda, B., Gurr y , C., Maur er-Stroh, S., Naido o, D., von Eije, K . J., Perkins, M. D., v an Ker khov e, M., Hill, S. C., Sa bino, E. C., Pybus, O. G., Dye, C., Bhatt, S., Flaxman, S., Suchard, M. A., Grubaugh, N. D., Ba ele, G., and F aria, N. R. (20 22). Global dispar ities in SARS-CoV-2 genomic surveillance. Natu r e Communic ations , 13(1):70 03. [9] Bro oks, L. C., F arr ow, D. C., Hyun, S., Tibshirani, R. J., and Rosenfeld, R. (2015). Flexible mo del- ing o f epidemics with a n empirica l Bay es fr amework. PL oS Computational Biolo gy , 11 (8):e1004 382. [10] Case, B., Sa lcedo, M. V., and F ox, S. J. (2 025). An accurate hierarchical model to for ecast diverse seasona l infectious diseases. me dRxiv , pa ges 202 5–03 . [11] Castro , L . A., Bedford, T., and Ancel Meyers, L. (202 0). E arly prediction o f a nt igenic transitio ns for inﬂuenza a /h3n2. PLOS Computational Biolo gy , 1 6(2):1–23 . [12] Cramer , E. Y. et al. (202 2a). Ev aluation of individual and ens emble proba bilistic for ecasts of CO VID-19 mortalit y in the United States. Pr o c e e dings of the National A c ademy of Scienc es , 119(15 ):e21135 6111 9. [13] Cramer , E. Y., Huang, Y., W ang, Y., Ray , E. L., Cornell, M., Bracher, J., Brennen, A., Ri- v adeneir a, A. J . C., Gerding , A., Ho use, K., Jaya wardena, D., Ka nji, A. H., K handelwal, A., Le, K., Mo dy , V., Mo dy , V., Niemi, J ., Stark , A., Shah, A., W attanchit, N., Zorn, Ma rtha W amd Reich, N. G., and Conso rtium, U. C.-. F. H. (2022b). The United States COVID-19 forecas t h ub data set. Scientiﬁc Data , 9(1):4 62. [14] Devroy e, L. (2006). Nonuniform r andom v ar iate g eneration. Handb o oks in op er ations r ese ar ch and management scienc e , 13:83 –121 . 25 [15] Dong, E., Du, H., and Gar dner, L. (20 20). An interactiv e w eb-bas ed dashboard to track CO VID-19 in real time. The L anc et Infe ctious Dise ases , 20(5):533 –534 . [16] Drake, J. W. and Holla nd, J . J. (19 99). Mutation rates among rna viruses . Pr o c e e dings of t he National A c ademy of Scienc es , 96(2 4):1391 0–139 13. [17] Du, H., Dong, E ., Ba dr, H. S., Petrone, M. E., Grubaugh, N. D., and Gardner , L. M. (202 3). Incorp ora ting v aria nt frequencies data into short-ter m for ecasting for COVID-19 cases and deaths in the USA: a deep learning a pproach. eBioMe dicine , 89. [18] Dudley , C., Magdaleno, R., Harding, C., Shar ma, A., M artin, E., and E isenberg, M. (2025). Mantis: A simulation-grounded foundation mo del for disease forecas ting. arXiv pr eprint arXiv:250 8.12260 . [19] Efron, B. and Tibshirani, R. J. (19 94). An intr o duction to the b o otstr ap . Cha pman and Hall/ CRC. [20] F eatherstone, L. A., Zhang, J. M., V aughan, T. G., and Duchene, S. (2022). E pidemiologica l infer- ence from patho gen g enomes: a r eview o f phylodynamic mo dels and a pplications. V irus Evolution , 8(1):veac045. [21] Gilb ert, N. (2019). A gent- b ase d mo dels . Sage P ublications. [22] Gneiting, T., Bala bda oui, F., and Raftery , A. E. (200 7). Proba bilistic forec asts, calibration and sharpness. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 69(2 ):243–2 68. [23] Holmes, E . C. (200 9). T he evolutionary genetics of emerging viruse s. Annual R eview of Ec olo gy, Evolution, and Systematics , 4 0(V o lume 40, 2009):3 53–37 2. [24] Hornik, K., Stinc hcombe, M., and White, H. (1989). Multilay er feedforward net works are universal approximators. Neur al networks , 2(5):35 9–36 6. [25] Hu, E. J ., Shen, Y., W allis , P ., Allen-Zhu, Z., Li, Y., W a ng, S., W ang, L., a nd Chen, W. (2022). Lora: Low-rank a daptation of larg e languag e mo dels. Iclr , 1(2):3. [26] Hyndman, R. J. and Athanasop oulos, G. (201 8). F or e c asting: Principles and Pr actic e . OT exts. [27] Ioannidis, J . P ., Cr ipps, S., and T anner, M. A. (2022). F or ecasting for COVID-19 has failed. International Journal of F or e c asting , 38 (2):423– 438. [28] Iwashita, A. S. and Papa, J . P . (2018). An overview on concept dr ift lear ning. IEEE ac c ess , 7:1532 –154 7. [29] Johns Hopkins Univ ersity (2023). Jo hns Hopkins COVID-19 data h ub ends after three y ears. https: //hub .jhu. edu/2023/03/10/coronavirus- resou rce- center- data- hub- ends/ . Accessed 2026- 02-04 . [30] Kandula, S., Y ang, W., and Sha man, J. (2017). Type-and subt yp e-sp eciﬁc inﬂuenza for ecast. Americ an Journal of Epidemi olo gy , 18 5(5):395 –402 . [31] Kapitsinis, N. (2020). The under lying fa ctors of the COVID-19 spatially uneven spr ead. Initial evidence fro m regions in nine EU c ountries. R e gional Scienc e Policy & Pr actic e , 1 2(6):1027 –104 6. [32] Ko elle, K . a nd Rasmussen, D. A. (201 5). T he eﬀects of a deleterious mutation lo ad on pa tterns of inﬂuenza a/ h3n2’s antigenic evolution in humans. eLife , 4:e 07361 . [33] Kolman, Shannon (2023). Dise ase F orecasting T oo ls Can Sup- po rt Policymaking During Epidemics. Accessed: 2025- 07-13 , https: //www .ncsl .org/health/disease- forecasting- tools- can- support- policymaking- during- epidemics . [34] Kor ber , B., Fisch er, W., and Theiler, J. (202 5). Real-time monitoring o f SARS-CoV-2 evolution during the COVID-19 pandemic. Cel l Host & Micr ob e , 33(11):18 02–1 806. 26 [35] Kor ber , B., Fischer, W. M., G nanak aran, S., Y o on, H., Theiler, J., Abfalterer, W., Hengartner, N., Giorgi, E . E., B hattachary a, T., F oley , B., Hastie, K. M., Parker, M. D., Partridge, D. G., Ev ans, C. M., F ree man, T. M., de Silv a, T. I., McDanal, C., P erez, L. G., T ang, H., Mo on- W a lker, A., Whelan, S. P ., La Branche, C. C., Saphire , E. O ., and Monteﬁori, D. C. (2 020). T racking changes in SARS-CoV-2 spike: evidence that D61 4G increases infectivity of the COVID-19 virus. Cel l , 182(4):81 2–82 7. [36] LeCun, Y., Bengio, Y., and Hinton, G. (201 5). Deep lea rning. natur e , 5 21(755 3):436– 444. [37] Ling-Hu, T., Rios-Guzman, E ., Lor enzo-Redondo , R., O zer, E. A., and Hultquist, J . F. (2022). Challenges a nd opp ortunities for glo bal genomic surveillance strategie s in the COV ID-19 era . Viruses , 14 (11):253 2. [38] Lop ez, V. K ., Cramer , E. Y., Pagano, R., Drake, J. M., O’Dea, E. B., Adee, M., Ayer, T., Chhatw al, J., Da lgic, O. O ., Ladd, M. A., et al. (20 24). Challenges of COVID -19 Case F orecasting in the US, 2 020– 2021 . PL oS Computational Biolo gy , 20(5):e101 1200 . [39] Lyu, H., Im tiaz, A., Zhao, Y., and Luo , J. (2023). Human behavior in the time of COVID-19: Learning fro m big data. F r ontiers in Big Data , 6 :10991 82. [40] McGow an, C. J., Bigge rstaﬀ, M., Johansso n, M., Apfeldorf, K . M., Ben-Nun, M., Bro o ks, L., Conv ertino, M., Errag unt la, M., F arrow, D. C., F ree ze, J., Ghosh, S., Hyun, S., Kandula, S., Lega, J., Liu, Y., Mic haud, N., Morita, H., Niemi, J., Ra makrishnan, N., Ray , E. L., Reich, N. G., Riley , P ., Sha man, J., Tibshirani, R., V espignani, A., Zhang, Q., Reed, C., and The Inﬂuenza F oreca sting W o rking Group (2019). Collab ora tive eﬀor ts to fo recast s easonal inﬂuenza in the united states, 2015– 2016 . Scientiﬁc R ep orts , 9 (1):683. [41] McInnes, L., Healy , J., and Melville, J. (20 18). UMAP: Uniform manifold approximation a nd pro jection for dimensio n reduction. arXiv pr eprint arXiv:180 2.03426 . [42] McKay , M. D., Beckman, R. J., and Conover, W. J . (1979). A comparis on of three metho ds for selecting v alues of input v a riables in the analysis of o utput from a co mputer co de. T e chnometrics , 21(2):239 –245 . [43] Murph, A. C., Be esley , L. J., Gibson, G. C., Cas tro, L. A., Del V alle, S. Y., and Osth us, D. (2026). B eyond equal weights: A disea se-agno stic approach to ense m ble learning for infectious disease forec asting. Manuscript under r eview a t Nature Communications. [44] Murph, A. C., Gibson, G. C., Amona, E. B., Beesley , L. J., Castro , L . A., Del V alle, S. Y., and Osthus, D. (2025 ). Synthetic method of a nalogues for emerging infectious disea se forecas ting. PLOS Computational Biolo gy , 21(6 ):e10132 03. [45] Nair, N. G., Satpa th y , P ., Christo pher, J., et al. (2019 ). Cov ariate shift: A review a nd analy sis on classiﬁers . In 2019 Glob al Confer enc e for A dvanc ement in T e chnolo gy (GCA T) , pag es 1 –6. IEEE. [46] Osthus, D. and Moran, K. R. (20 21). Multiscale inﬂuenza forecasting. Natur e Communic ations , 12(1):299 1. [47] O’T o ole, ´ A., Sc her, E., Underwoo d, A., Jackson, B., Hill, V., McCrone, J. T., Colq uhoun, R., Ruis, C., Abu-Dahab, K., T aylor, B., Y ea ts, C., Du Plessis, L., Maloney , D., Medd, N., A ttw oo d, S. W., Aanensen, D. M., Holmes, E. C., P ybus, O. G., and Rambaut, A. (20 21). Assignment of epidemio logical lineages in an emerg ing pandemic using the pangolin to ol. Virus evolution , 7(2):veab064. [48] Ray , E. L., W ang, Y., W o lﬁnger, R. D., and Reich, N. G. (2025 ). Flusio n: Integrating m ultiple data source s for accura te inﬂuenza pr edictions. Epidemi cs , 50:1 0081 0. [49] Roster, K ., Connaug h ton, C., and Ro drig ues, F. A. (2 022). F ore casting new diseas es in low-data settings using transfer learning . Chaos, Solitons & F r actals , 161 :11230 6. [50] Shadb olt, N., Br ett, A., Chen, M., Marion, G., McKendrick, I. J., Pano vsk a-Griﬃths, J., Pellis, L., Reeve, R., and Swallow, B. (20 22). The c hallenges of da ta in future pandemics. Epide mics , 40:100 612. 27 [51] Sherra tt, K., Gruson, H., Grah, R., J ohnson, H., Niehus, R., Pra sse, B ., Sa ndmann, F., Deuschel, J., W olﬀram, D., Abb ott, S., Ullric h, A., Gibson, G., Ray , E. L., Reic h, N. G., Sheldon, D., W ang, Y., W a ttanachit, N., W a ng, L., T rnk a, J., Ob oz inski, G., Sun, T., Thanou, D., Pottier, L., Krymov a, E., Meinke, J. H., Bar barossa , M. V., Leithauser, N., Mohr ing, J., Sch neider, J ., Wlazlo, J., F uhrma nn, J., Lange, B., Ro dia h, I., Bac cam, P ., Gurung, H., Stage, S., Suc hoski, B., Budzinski, J., W alr av en, R., Villanuev a, I., T ucek, V., Smid, M., Za jicek, M., P erez Alv a rez, C., Reina, B., Bosse, N. I., Meakin, S. R., Ca stro, L., F airchild, G., Michaud, I., Osthus, D., Alaimo Di Lor o, P ., Maruotti, A., Eclerov a, V., Kraus , A., K raus, D., Pribylov a, L., Dimitris, B., Li, M. L., Sa ksham, S., Dehning, J., Mohr, S., P riesemann, V., Redlar ski, G., Bejar, B., Ardeng hi, G., Parolini, N., Zia relli, G., Bo ck, W., Heyder , S., Hotz, T., Singh, D. E., Guzman-Merino, M., Aznarte, J. L., Morina, D., Alonso, S., Alv arez, E., Lop ez, D., Prats, C., Burg ard, J. P ., Ro dlo ﬀ, A., Zimmermann, T., Kuhlmann, A., Zib ert, J., Pennoni, F., Divino, F., Catala, M., Lo vison, G., Giudici, P ., T ara ntin o, B., Barto lucci, F., Jona Lasinio, G., Ming ione, M., F arcomeni, A., Sr iv as tav a, A., Mo nt ero -Manso, P ., Adiga, A., Hurt, B., Lewis, B., Mara the, M., Porebski, P ., V enk a tramanan, S., Bartczuk, R. P ., Dreger, F., Gambin, A., Gogolewski, K ., Gruziel-Slomk a , M., Kr upa, B., Moszy ´ nski, A., Niedzielewsk i, K ., Nowosielski, J., Radwan, M., Rakowski, F., Semeniuk, M., Szczurek, E., Zielins ki, J., Kisiele wski, J., Pab jan, B., Holger, K., K heifetz, Y., Scholz, M., Biece k, P ., Bo dych, M., Filinsk i, M., Idzikowski, R., Krueg er, T., Ozans ki, T., Bracher, J., a nd F unk, S. (20 23). Pr edictive p erforma nce of m ulti-mo del e nsemble forecasts of COVID-19 across Euro p ean na tions. eLif e , 12:e8 1916 . [52] Shu, Y. and McCauley , J . (2017 ). GISAID: Global initia tive on sharing a ll inﬂuenza da ta–from vision to r eality. Eu r osurveil lanc e , 2 2(13):30 494. [53] T racy , M., Ce rd´ a, M., and Keyes, K. M. (2018). Agent-based mo deling in public health: cur rent applications and future directions. Annual R eview of Public H e alth , 39(1):77– 94. [54] T urtle, J ., Riley , P ., Ben-Nun, M., and Riley , S. (2 021). Accura te inﬂuenza forecasts using type- sp eciﬁc incidence data for small geo graphic units. PL oS Computational Biolo gy , 17 (7):e1009 230. [55] Universit y of Pittsburgh (20 26). P ro ject tycho. https: //www .tych o.pitt.edu/data/ . Acces sed 2026- 02-03 . [56] V aswani, A., Shazeer, N., Parmar, N., Uszkoreit, J ., Jo nes, L., Gomez, A. N., Kaiser , L., and Polosukhin, I. (2 017). At tention is all you need. A dvanc es in neu r al information pr o c essing systems , 30. [57] W orld Health Or ganization (2020). Pneumonia of unknown cause – China. https: //www .who. int/emergencies/disease- outbreak - news/item/2020- DON229 . Accessed: 2026- 03-04 . 28 Supplemental Material to “Lev eraging Se quence and Syn theti c Data to Impro v e Epid emic F orecasting” S1 Description of M utAn tiGen Input P arameters In the MutAnt iGen ABM [ 32 , 2 ], age n ts may b e b orn, may die, may hav e contact with o ther hosts (whic h may or may not lead to transmission), and may re cov er fr om an infection. E ach individual carries information on all viruses by which they hav e alrea dy b een infected, as well as informatio n ab out any current vir al infection. The information o n all curr ent and past viral infections includes the a nt igenic type, the cross-immunit y b etw een a nt igenic types, the antigenic distance be t ween the challenging strain and the hos t’s history of encountered strains, pr obability of infection, and a ny imm unities. Viruses may mutate in antigenic phenotype, which can create turnover and outbrea k in the po pulation when there is little cross-immunit y in a po pulation to the emer gence of a new phenotype. Viruses also mutate in a “loa d” phenot yp e that aﬀects their a bsolute ﬁtness, indep endent of immunit y in the hos t p o pulation. W e will now discuss in more d etail the parameters that w ere set ﬁxed in the MutAn tiGen ABM, the parameters ov er which w e sampled, and the justiﬁcations behind the ra nges for the sampled parameters. The list of p ossible pa rameters for the MutAntiGen ABM, their des criptions, and what they w ere set for our sim ulation runs, ar e as follows. P ara meters with an astrix were sampled accor ding to the sampling pro cedures describ ed in the main paper and accor ding to T able S1 (and therefor e are not given a set v a lue here). 1. burnin : Num b er of days to wait be fore logg ing o utput. This parameter was set to 0. 2. endDa y : T otal num ber o f days to s imu late. This para meter was set to 36 50. 3. printStep : Interv al in days at which output is written to out.ti meser ies . This par ameter was set to 7. 4. tipSamplingSt ar tDa y : Day on which sampling for the phylogenetic tree b eg ins, delay ed to av oid bas al multif urcatio n at time zer o. This par ameter was set to 56. 5. tipSamplingEndDa y : Day on whic h sa mpling for the phylogenetic tree ends. This parameter was set to 3640. 6. tipSamplingRa te : Number o f samples stor ed p er deme pe r day . This parameter was s et to 1. 7. tipSamplesPerDeme : Maximum num b er of samples allow ed pe r deme. This parameter was set to 10 0000. 8. tipSamplingPropor tional : Indicator of whether sampling is prop or tional to prev alence when m ultiple demes are pres ent . This par ameter was set to tr ue. 9. treePropor tion : Prop ortion of sampled tips used in tree r econstruction. This par ameter was set to 1. 10. diversitySamplingCount : Number of samples drawn to co mpute diversit y , Ne τ , and serial int erv al. This pa rameter was se t to 1000 . 11. net auWindo w : Size of the time window, in days, ov er whic h Ne τ is calculated. This parameter was set to 100. 12. repea tSim : Indica tor o f whether the simulation is repea ted un til endD a y is reached. This parameter was s et to false. 13. immunityReconstruction : Indicator of whether immunit y r econstruction output is written to out .immu nity . This pa rameter was set to false. 29 14. memor yProfiling : I ndicator of whether memor y proﬁling is enabled, requiring a Jav a agent. This par ameter was set to fa lse. 15. yearsFromMK : Num b er of years considered a s present when c alculating MK s tatistics. This parameter was s et to 1.0. 16. pcaSamples : Indicator of whether the v irus tree is ro tated and ﬂipped using PCA. This pa- rameter was set to false. 17. det ailedOutput : Indicator of whether detailed host and virus output ﬁles are written to enable chec kpo in ting. This parameter was set to false. 18. rest ar tFromCheckpoint : Indica tor of whether the p opulatio n is lo aded fr om a previous chec kpo in t. This parameter was s et to false. 19. hostImmuneHisto r ySampleCount : Num ber of host immune histories sa mpled when com- puting mean host imm unity . This para meter was se t to 100 00. 20. fitSampleCount : Number of vir al ﬁtness samples co llected. This pa rameter was set to 100. 21. printFitSamplesStep : Interv al in days at which v iral ﬁtness samples are print ed. This pa - rameter was set to 1000. 22. demeCount : Number of demes in the metap opulation. This para meter was set to 1 . 23. demeNames : Names ass igned to each deme. This parameter was set to "tropi cs" . 24. initialNs ∗ : Initial p opulatio n sizes o f ea ch deme. 25. bir thRa te : Per-individual daily birth ra te. This parameter was s et to 0.0 00091 . 26. dea thRa te : Per-individual daily death r ate. This parameter was set to 0 .0000 91. 27. sw apDemography : Indicator o f whether overall p opulatio n size is kept c onstant. This param- eter was set to true. 28. initialI : Initial n umber o f infected individuals. This par ameter was s et to 7 406. 29. initialDeme : Index of the deme in which infectio n is initiated. This par ameter w as se t to 1. 30. initialPrR : P rop or tion of the p opulatio n with pr ior imm unity to the initial s train. This pa - rameter was set to 0.5088 . 31. bet a ∗ ( β ): T ransmiss ion r ate in contacts p er individual p er da y . 32. nu ∗ ( ν ): Recov ery rate in recov eries per individual p er day . 33. betweenDemePr o : Relativ e transmission rate b etw een demes compared to within-deme trans- mission. This parameter was s et to 0.00 00. 34. externalMigra tion : Additional infections contributing to the fo rce o f infection. This par am- eter was set to 200.0. 35. transcendent al : Indicator o f whether a gener al recovered clas s is included. This parameter was set to false. 36. immunityLoss : Rate of immunit y loss from recov ered to susceptible p er individual per day . This par ameter was set to 0 .0. 37. initialPrT : Initial fraction of the po pulation in the general recovered class. This pa rameter was set to 0.0. 38. back gr oundImm unity : Indicator of whether the p opulation starts with ba ckground imm unity to the cur rent str ain. This parameter was set to false. 30 39. back gr oundDist ance : Antigenic distance o f background imm unit y from the initial stra in. This par ameter was set to 0 .2. 40. demeBaselines : Baseline se asonal for cing for ea ch deme. This parameter was set to 1. 0 . 41. demeAmplitudes ∗ : Amplitude of seasonal forcing for ea ch deme. 42. demeOffsets : Seasonal phas e oﬀsets fo r ea ch deme relative to the year. This parameter was set to 0. 0 . 43. phenotypeSp ace : Representation of pheno type space used in the mo del. This par ameter was set to "m utLoad " . 44. lambda ∗ ( λ ): Deleterious mut ation ra te p er ge nome p er transmissio n. 45. mutCost ∗ : Fitness cost asso ciated with deleterious mutations. 46. probLethal : P robability that a mutation is lethal. This pa rameter was set to 0 .0. 47. epsilon : Beneﬁcial mutation ra te, sca led with p opulation size. This parameter was set to 0.16. 48. epsilonSlope : Dep endence of the be neﬁcial mutation rate on mean m utational lo ad. This parameter was s et to 0.0. 49. epsilon mut ∗ : See discussion b elow. 50. initialI pr op ∗ : See discussion b elow. 51. lambdaAntigenic ∗ : An tigenic mutation r ate p er tra nsmission. 52. meanAntigenicSize ∗ : Mean siz e of antigenic m utations. 53. antigenicGammaShape : Shap e parameter of the ga mma distribution for antigenic mutation sizes. This parameter was s et to 2.0. 54. thresholdAntigenicSize : Minim um antigenic mutation size r equired for a mutation to be retained. This parameter was s et to 0.01 2. 55. antigenicEvoSt ar tD a y : Day on which a ntigenic mutations b egin to o ccur. This para meter was set to 0. 56. cleanUpDist ance : Ant igenic distance b eyond which stra ins are remov ed from the simulation. This par ameter was set to 0 .2. 57. demoNoiseScal er : Scaling fa ctor a pplied to demo graphic nois e. This parameter was set to 0.0. 58. muPhenotype : Phenotypic m utation rate p er individual p er day . This pa rameter was set to 0.0. 59. smithConversion : Multiplier conv erting a nt igenic distance in to c ross-immunit y . This par am- eter was set to 0.1. 60. homologou sImmunity : Imm unity level conferr ed by antigenically identical viruses . This pa- rameter was set to 0.95. 61. initialTraitA : Initial v alue of the ﬁrs t dimension o f host immunit y . This para meter was set to -6.0 . 62. meanStep : Mean mutation step size. This pa rameter was se t to 0.3. 63. sdStep : Standard deviation of mutation step size. This para meter was set to 0 .3. 64. mut2D : Indicator o f whether m utations o ccur in a full tw o-dimensional angular space. This parameter was s et to false. 31 T a ble S1: Initial para meter r anges of the Latin Hyper cube Sample Parameter Description Initial LHS Lo wer Bound Initial LHS Upp er Bound N Population Size 1 . 0 × 10 4 1 . 0 × 10 4 demeAmplitudes Strength of Seasonality F orcing 0 2 . 0 × 10 − 1 lambdaAntigenic Antigenic mutation rate per transmission 8 . 57 × 10 − 5 2 . 57 × 10 − 3 meanAntigenicSize Mean eﬀect size of antigenic mutations 1 . 2 × 10 − 3 1 . 2 × 10 − 1 lambda Deleterious m utation rate p er genome p er t ransmission 9 . 5 × 10 − 3 4 . 08 × 10 0 mutCost Fitness cost associated with deleterious mutations 8 . 0 × 10 − 4 8 . 0 × 10 − 2 beta T ransmission rate in contacts p er individual p er day 1 . 43 × 10 − 1 2 . 25 × 10 0 nu Recov ery rate in recov eries p er individual p er day 7 . 14 × 10 − 2 2 . 5 × 10 − 1 epsilon mut Multiplier to the baseline scaling of the fraction of beneﬁcial mutations 5 . 0 × 10 − 1 1 . 5 × 10 0 initialI prop Initial p rop ortion of p opulation size infected 1 . 0 × 10 − 4 1 . 0 × 10 − 3 The full set of bo unds for the parameters that were sa mpled via a L atin hs S2 Computational Mo diﬁcations to the MutAn tiGen Soft w are There w ere several challenges to running a massive amount of MutAn tiGen sim ulations ov er these parameter r anges. First, it is p ossible for a g iven sim ulation to “br eak” due to the current state of the system b ecoming untenable for the given co mputational resources 1 . Se cond, it w as nearly imp os sible to predict ho w long a given simulation might take: for two simulation runs that w ere very similar in terms of parameter settings , one might complete fully in 3 0 minut es, w hile the other might s tall out after r unning for 7 days. Third, not all r uns showed disease turnov er a nd b ehaviors of mult iple antigenic mutations. La stly , the o riginal co ding of MutAntiGen w as not w ell-desig ned for running thousands of simulations in para llel on an HPC environmen t. Several mo diﬁcations were co ded a nd strategies taken to extra ct robust a nd diverse synthetic data via the MutAn tiGen ABM and to ha ndle the ab ov e challenges. First, we r e-co ded the mo del to a llow for multiple simulations to be run in par allel on the same HPC system; this modiﬁed v ersion is av ailable on this pap er’s GitHub pag e ( https: //git hub.c om/lanl/precog ). This re-co ding also disabled s ome functionality of MutAntiGen that was no t necessary to these exp eriments a nd was stalling o ut some of the simulations. Second, w e created a “max wall time” that would cut simulations that did not complete within a cer tain time. While this cut oﬀ may bia s the sampling, w e did not have the computational resour ces to allow a g iven run to go for longer than 10 hours. Lastly , we p er formed a mo diﬁed spac e-ﬁlling design fo r the selection of parameter s of the rang es o f interest that inco rp orated sub ject matter exp er tise: 1. W e p er formed s im ulations ov er a spa rse L atin-Hyp ercub e s ample (LHS) [ 42 ] o f size 400 0 over the para meter r anges given ab ove. Of these 400 0 s amples, 815 completed. 2. F ollowing this initial LHS, we deter mined which s ets of input par ameters led to simulations that represented antigenic mutation and turnover in do minant antigenic type. There were 151 se ts o f such input pa rameters, which cor resp onded to ar ound 18 % of the co mpleted simulations. 3. W e p erfor med 2 0 simulations a t ea ch of these input parameter s, using a diﬀerent seed for ea ch input; since MutAn tiGen is heavily sto chastic, these replications still g av e very diﬀerent r esults. As an example, Fig ure S1 shows 10 of the 20 replic ations for a sing le input para meter set. 4. Of the simulations run in replicate, 183 2 completed successfully . Around 65% of these co mpleted simulations exhibited antigenic m utation and turnov er in domina n t antigenic t yp e. This ﬁnal set was used to train the foreca sting mo del. 1 All computations w ere p erformed on a n HPC cluster running Red Hat En terprise Linux 8.10 with Slurm, using dual-so c ket In tel Xeon E5-2695 v3 x86 64 no des (56 hardware threads, ∼ 256 GB RAM per no de) compiled wi th GCC 8.5.0. 32 6 7 8 9 10 1 2 3 4 5 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 0 50000 100000 150000 0 10000 20000 30000 40000 50000 0 50000 100000 150000 200000 250000 0 25000 50000 75000 0e+00 1e+05 2e+05 3e+05 0 20000 40000 60000 0 25000 50000 75000 100000 0 50000 100000 150000 0e+00 1e+05 2e+05 3e+05 0e+00 1e+05 2e+05 3e+05 4e+05 T otal Cases Figure S1: 10 MutAn tiGen output simulated from identical input para meters.. S3 Observ ation M o d el Details Let y 1: T be a time series of le ngth T . In this se ction, we describ e the observ ation mo del. That is, we descr ibe how we sto chastically convert y 1: T int o z 1: T ′ . F or every MutAntiGen time series y 1: T , the observ ation mo del cr eates 20 new time ser ies. The obser v a tion mo del has thre e steps: (1) scaling, (2) noise and (3) o utliers. All 20 observ ation mo del time series underg o scaling, 10 of the observ atio n mo del time ser ies underg o no ise, a nd ea ch of the 20 realiza tions has a 25% chance to undergo outliers. Below we detail the o bserv ation mo del. Scaling. W e sample a new n umber of time steps T ′ for the time series x 1: T ′ , where T ′ ∼ Uniform(52 , T ) (uniform over the integers). W e then linea rly int erp ola te the original time series y 1: T to be o f leng th T ′ , resulting in a new time s eries x 1: T ′ . In R co de, this is done via the a pprox( ) function: x[1:T’ ] = approx(x = 1 :T, y = y, xout = seq(1, T, length. out = T’)) $ y This new time series x 1: T ′ has the same dynamics as in y 1: T , but those dynamics can o ccur o n a fas ter time scale than y 1: T . The smaller T ′ , the faster the time scale. Noise. Next, we add noise to x 1: T ′ , r esulting in a new time series v 1: T ′ . The pr o cess for adding no ise is as follows: κ ∼ Uniform(1 . 5 , 3 . 5) ǫ t | κ ∼ Uniform( κ − 1 , κ ) v t = x t ∗ ǫ t Note that κ is common to the time series x 1: T ′ while ǫ t is diﬀerent for each element o f the time s eries. The no ise is multiplicativ e where ǫ t ∈ [3 . 5 − 1 , 3 . 5 ]. When κ is close to 3 .5, mor e noise is added to the time s eries. When κ is close to 1.5, les s noise is added. If x 1: T ′ is a time series that we do not add noise to, then we simply set v 1: T ′ to x 1: T ′ . Outliers. Time series of public health data can ha ve high outliers due to newly discov ered da ta dumped o nt o a single date or low outliers due to non-rep orting. It is impor tant to have training data that a ccurately reﬂe cts real-world data, other wise a machine learning mo del — having never seen outliers — may assume a large leap or drop in the time series reﬂects a meaningful change in the time series dynamics rather than a spurio us change in the da ta collection pro cess. W e r andomly select b etw een 5 and 10 time steps ∈ { 1 , 2 , . . . , T ′ } to b e outliers. Half o f those time steps represent low o utliers and the other half r epresent high outlier s. W e deﬁne multipliers λ t for all 33 times t ∈ { 1 , 2 , . . . , T ′ } as follows: λ t =      Uniform(2 , 10) if t is a high outlier Uniform(0 , 0 . 05) if t is a low outlier 1 if t is not an outlier Finally , we deﬁne the output time ser ies o f our obse rv a tion mo del, z 1: T ′ as follows: z t = v t ∗ λ t . 10 realizations of z 1: T ′ are plotted in Fig ure 4 . Application. T o crea te the synthetic to tal cases training data, we a pply the observ ation mo del to the approximately 1,800 MutAntiGen o utputs where ea ch input is r ecycled 2 0 times making 20 new z 1: T ′ time series . T o create the synthetic v ariant attributable ca ses tr aining data, we sample up to ten of the v ar iant tim e series and, fo r each one o f them, we apply the scaling plus outlier r outines (making one new z 1: T ′ time se ries) and separately a pply the s caling plus noise plus outlier routines (mak ing another new z 1: T ′ ). That is, ea ch v a riant time ser ies mak es t wo new z 1: t ′ time series. The total num ber of time s eries for training , synthetic total cases or synthetic v a riant attributable cases , results in ab out 36,000 time serie s (refer to T able 3 for sp eciﬁcs). S4 T ransformer Mo del Details Data and ob jective. W e consider a collection o f univ ar iate time series, y 1: T , ea ch pr ovided as observ ations of nonnegative counts y t (“cases” ) indexed b y time t . Our goal is probabilis tic f orec asting: given the most rece n t C = 20 observ ations, we predict the distribution of the next H = 4 observ ations. F o recast uncertaint y is represe n ted via predictive quantiles at levels τ ∈ T = { 0 . 00 05 , 0 . 005 , 0 . 01 , 0 . 025 , 0 . 05 , 0 . 1 , . . . , 0 . 9 , 0 . 9 5 , 0 . 975 , 0 . 99 , 0 . 995 , 0 . 9995 } where τ ∈ [0 , 1] and |T | = 27. T raining examples (roll ing wi ndo ws). T raining uses ra ndomly sampled rolling windows. F or a series y 1: T , we rep eatedly sample a s tart time t such that a co nt iguous blo ck of length C + H ex ists, and deﬁne the input and outputs (tar gets) as y in t = { y t − C +1 , y t − C +2 , . . . , y t } , y out t = { y t +1 , . . . , y t + H } . Mini-batches a re formed by sampling a series with pr obability pr op ortiona l to its num ber o f av a ilable windows, then sa mpling a window uniformly within that ser ies. P er-window normalization and rescaling. T o stabilize tra ining a cross se ries with diﬀer ent mag- nitudes of case co unts, each input window is nor malized by its own max im um m t = max( y in t ) , z in t = y in t /m t , z out t = y out t /m t , with the conv en tion that if m t = 0 (or is non-ﬁnite) we do not resca le that windo w. The model op erates o n z in t and outputs for ecasts on the no rmalized scale. Predictions are returned to the or iginal scale b y m ultiplying by m t . Input p erturbation (augme n tation). T o improv e robustness to observ ation no ise and related data-gener ating irregula rities, eac h sampled training example is duplica ted with a pe rturb ed input context: ˜ y t = y t ∗ u t , u t iid ∼ Uniform(0 . 85 , 1 . 15) . Thu s, ˜ y in t = { ˜ y t − C +1 , ˜ y t − C +2 , . . . , ˜ y t } is a per turb ed version of y in t . The future targe ts y out t are unch ange d. What this seeks to accomplish is a more sta ble learning represe n tation. By providing the mo del with tw o distinct inputs that map to the sa me output, the mo de l should b e b etter p ositioned to learn s table underlying representations and less lik ely to learn s purious ones. This yields an eﬀectiv e batch size that is twice the nominal ba tch size. 34 Mo del arc hitecture. W e r epresent the forecas ting function with a compa ct transformer enco der. Each sca lar input in z in t is mapp ed to a d - dimensional embedding ( d = 256) via a learned linear pro jection, and a lea rned positiona l embedding is added for the C time p o sitions. The em b edded sequence is pr o cessed by a transformer enco der with L = 2 lay ers and n head = 4 attention hea ds (feedforward dimension 5 12). The ﬁnal hidden state (at the most r ecent input time po int ) is mapp ed through a linear readout to pro duce H × |T | outputs. Quan tile parameterization. The model outputs predictive quantiles ˆ z h,τ for each horizon h = 1 , . . . , H and qua nt ile le vel τ ∈ T . T o pr even t quant ile c rossing, q uantiles are constructed us ing a ba se quantile plus no nnegative increments: ˆ z h,τ 1 = a h, 1 , ˆ z h,τ k = ˆ z h,τ 1 + k X j =2 softplus( a h,j ) , k = 2 , . . . , |T | , where a h,j are unco nstrained net work outputs and softplus( x ) = lo g(1 + e x ) ensures pos itivit y of the increments. F o recasts on the o riginal scale ar e obtained as ˆ y h,τ = m t ˆ z h,τ . Loss function (pin ball lo ss). Parameters are estimated by minimizing the mea n quantile (pin ball) loss ov er ho rizons and qua n tile levels. F or obser v atio n y t + h and predicted qua nt ile ˆ y t + h,τ , ρ τ ( y t + h − ˆ y t + h,τ ) = max  τ ( y t + h − ˆ y t + h,τ ) , ( τ − 1)( y t + h − ˆ y t + h,τ )  , and the training o b jective is the average of ρ τ across all ( h, τ ) and all examples in each mini-batch. Optimization, v alidation, and mo del selection. W e optimize the model using Adam with a cosine-decayed learning rate schedule (from 5 × 10 − 4 to 5 × 10 − 5 ) ov er the prescrib ed num ber o f weigh t upda tes (we used 97,65 6). V alidation uses a ﬁxed set of ra ndomly sampled windows (held co nstant by a ﬁxed seed). W e maintain a n e xp onential moving average (EMA) of par ameters with s mo othing co eﬃcient α EMA = 0 . 98 a nd select the ﬁnal mo del as the EMA c heckpoint with the low est v alidation loss. S5 Bo otstrapping Details In this study , nine diﬀerent mode ls made 27,5 40 fo recasts (5 1 states/ter ritories × 1 35 forecast da tes × 4 forecas ts horizons). Our go al is to deter mine which mo dels pro duce fo recasts that ar e statisti- cally s igniﬁcantly b etter than other mo dels. As a zero th o rder pass, we c an compute the metric o f int erest (e.g ., MAE o r WIS) for each mo del ov er the entire s et o f forec asts, a nd c ompare the num b ers. Whic hever mo del pro duced the be st num ber is the b est forec asting mo del. This, how ever, do es not take into account uncertaint y . T o ge t an e stimate of uncerta in ty , we turn to bo o tstrapping [ 19 ]. This requires us to resample the 2 7,540 forecas ts with repla cement. T o deal with the grouping s (e.g., s tates and forecas t dates), we ﬁrst sample states with replacement and then, for a sampled state, sample nine diﬀerent blo cks of time (where a blo ck of time is 1 5 co nsecutive weeks of forecast da tes). F or each state/time blo c k, we keep a ll hor izons and all mo dels. This r esults in a res ampling o f 2 7,540 for ecasts per mo del (5 1 resampled s tates × 9 resa mpled time blo cks × 15 dates p er blo ck × 4 horizo ns = 27,5 40 forecasts). F or each resampled data set, we compute each mo del’s MAE and WIS. W e repeat this data set resampling pro cedure 5,0 00 times. The r esults of the blo ck b o otstra p pro cedur e (blo ck) and the v a nilla bo otstrap pro cedure which ig nores groupings (iid) ar e shown in Figure S2 . T aking into account the g rouping v ariables in the blo ck bo otstrap pr o cedure results in wider uncertaint y interv a ls. The 95% conﬁdence int erv als presented in the main text were derived following this blo ck bo o tstrap pro cedure. 35 M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) iid block iid block iid block iid block iid block iid block iid block iid block 0.7 0.8 0.9 1.0 rMAE M(r ,t) M(st,t) M(sv ,t) M(a,t) M(r ,v) M(st,v) M(sv ,v) M(a,v) iid block iid block iid block iid block iid block iid block iid block iid block 0.6 0.7 0.8 0.9 1.0 rWIS Figure S2: B oxplots for 5,000 b o otstrap samples for rMAE and rWIS from the blo ck ed bo otstra p pro cedure (block) taking in to account state and forecast date groupings and the v anilla bo otstr ap pro cedure (iid) that d o es not addres s groupings . Wider uncertainties are observed when taking account of the gr ouping v ariables . The white, dotted hor izontal line is the p oint estimate for rMAE and rWIS. 36

Leveraging Synthetic and Genetic Data to Improve Epidemic Forecasting

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment