Amortized Bayesian inference for actigraph time sheet data from mobile devices

Mobile data technologies use ``actigraphs'' to furnish information on health variables as a function of a subject's movement. The advent of wearable devices and related technologies has propelled the creation of health databases consisting of human m…

Authors: Daniel Zhou, Sudipto Banerjee

Amortized Bayesian inference for actigraph time sheet data from mobile devices
AMOR TIZED BA YESIAN INFERENCE F OR A CTIGRAPH TIME SHEET D A T A FR OM MOBILE DEVICES D ANIEL ZHOU AND SUDIPTO BANERJEE UCLA DEP AR TMENT OF BIOST A TISTICS, 650 CHARLES E. YOUNG DRIVE, LOS ANGELES, 90095-1772. Abstract. Mobile data tec hnologies use “actigraphs” to furnish information on health v ariables as a function of a sub ject’s mov ement. The adven t of wearable devices and related tec hnologies has prop elled the creation of health databases consisting of human mov ement data to conduct researc h on mobilit y patterns and health outcomes. Statistical metho ds for analyzing high-resolution actigraph data dep end on the sp ecific inferen tial context, but the adv ent of Artificial Intelligence (AI) frameworks require that the metho ds be congruent to transfer learning and amortization. This article devises amortized Ba yesian inference for actigraph time sheets. W e pursue a Ba yesian approac h to ensure full propagation of uncertain ty and its quan tification using a hierarc hical dynamic linear model. W e build our analysis around actigraph data from the Physical Activity through Sustainable T ransport Approac hes in Los Angeles (P AST A-LA) study conducted by the Fielding Sc ho ol of Public Health in the Universit y of California, Los Angeles. Apart from ac hieving probabilistic imputation of actigraph time sheets, w e are also able to statistically learn ab out the time- v arying impact of explanatory v ariables on the magnitude of acceleration (MA G) for a cohort of sub jects. 1. Intr oduction Recen t dev elopments in wearable device technologies hav e led to the collection of mobile health data [James et al., 2016, Drewno wski et al., 2020] at v ery high resolutions (e.g., step rates, blo od pressure, heartb eat, activit y coun ts, etc.). A ctigr aphs are small p ortable devices with motion sensors, or ac c eler ometers , that detect physical mo vemen t by measuring acceleration along differen t axes and are able to collect large amounts of data [Plasqui and W esterterp, 2007, Sikk a et al., 2019]. They are increasingly conspicuous b ecause of their affordabilit y , accuracy , and av ailabilit y in smart-phones, smart-w atc hes, and other wearable devices. The collected data can be quic kly do wnloaded and easily accessible for statistical analysis to obtain insight in to their pattern and structure. Date : F ebruary 25, 2026. 1 There is a rapidly emerging literature on statistical analysis of data streaming from wear- able devices [Chang and McKeague, 2022, Luo et al., 2023, Banker and Song, 2023]. Acti- graph data analysis, which can b e regarded as a subset of streaming data analysis, is wit- nessing rapid growth in interest among biomedical and health scientists seeking to under- stand how environmen tal factors in teract with the personal attributes of a sub ject to define activit y-related health outcomes. Examples include, but are not limited to, the use of data from w earable devices as biomark ers and risk factors in the study of adv erse health outcomes for respiratory health [Kim et al., 2024]. In this article, w e devise a framew ork to carry out temp oral analysis of an original actigraph data set from the P h ysical A ctivit y through S ustainable T ransp ort A pproaches in L os A ngeles (P AST A-LA) study . The scientific ob jectiv es of the P AST A-LA study are broad and diverse. This article is concerned with devising a transfer learning framew ork for the P AST A-LA study using amortized Ba y esian inference [Radev et al., 2022, Sainsbury-Dale et al., 2024, Zammit-Mangion et al., 2025], which constitutes a k ey comp onen t of pro ducing fast and reliable inference using an AI-based interface for health providers. The outcome w e mo del is the magnitude of acceleration (MAG). Let x , y and z b e the dynamic acceleration of the b o dy . MAG at time p oin t t is defined as: (1) MA G t = p x 2 t + y 2 t + z 2 t , t = 1 , . . . , T . Ho w ever, the instan taneous MAG ev aluated at the original frequency 30Hz is erratic and do es not adequately represent the sub ject’s intensit y of ph ysical activity at that time. T o amend this problem, w e a verage the MAG v alues o ver 20 second-time steps, using an approach similar to [Migueles et al., 2017, Doherty et al., 2017]. T o motiv ate p eople to exercise, one of our interests is in deploying a h uman-interpretable system that takes in to accoun t environmen tal factors in the data to preemptiv ely construct a path or paths to exercise. Doing so would pro vide a n umerical b enchmark for p eople to ac hiev e as part of an exercise goal, as well as reduce or eliminate the o verhead of having to decide on a route for exercise b eforehand. Thus, the domain demands a temp oral regression mo del, which can b e emplo yed for analysis and feedback; as w ell as a generative mo del to sample new tra jectories should they b e deemed relev ant. Our sp ecific con tribution is to train neural net works using hierarc hical dynamic mo dels [W est and Harrison, 1997] within the BayesFlow learning framew ork [Radev et al., 2022]. In this pro cess, w e demonstrate the utility and structural flexibilit y of amortized inference to w ards analyzing MAGs from w earable devices. Bay esian analysis using exact p osterior sampling from the hierarc hical dynamic linear mo del serv es as our b enchmark for ev aluating the p erformance of amortized inference. While muc h of the statistical distribution theory used here is fairly familiar, we b elieve that the transfer learning framework devised here for w earable devices data is a no vel con tribution. 2 The remainder of the article pro ceeds as follows. Section 2 presents some details on the pro cessing and structure of our actigraph data. Section 3 in tro duces the actigraph time sheet and explains ho w its format accommo dates analysis using dynamic linear mo dels. Section 4 pro vides an ov erview of amortized Ba yesian inference as executed by BayesFlow . Section 5 extracts and organizes the essential mo deling ingredients for training. Section 6 illustrates the training of the net w ork using simulated data and applies the trained net w ork to our actigraph time sheet. Finally , Section 7 presents a brief discussion to conclude the pap er. 2. A ctigraph Da t a Actigraph y is a noninv asive metho d of monitoring human rest/activit y cycles. The result- ing data are collected from the relev an t actigraph unit to assess the cycles of activity and rest o v er several days to several weeks. The data used in this study come from the P hysical A ctivit y through S ustainable T ransp ort A pproac hes in L os A ngeles (P AST A-LA) study conducted on a cohort of 460 individuals monitored b et ween Ma y 2017 and June 2018 to as- sess their physical activit y in the W est w o o d b orough of Los Angeles for tw o separate p erio ds of one week each. Data were collected from v arious sources, including online questionnaires, a GPS device (Global-Sat DG-500), and a p ortable actigraph unit (Actigraph GT3X+). The resulting actigraph data are joined from biological and livelihoo d factors suc h as age, heigh t, w eight, ethnicity , sex, and BMI; geospatial co ordinates, time measures, and geo- graphical measures suc h as latitude, longitude, altitude, distance from home, distance from w ork, and distance from some other p oint of in terest; and the MAG actigraph measurement. The primary outcome MAG is directly measured from the motion of the actigraph sensor and has statistical relationships with other physical activity measures suc h as the metab olic equiv alen t of task (MET) and other energy exp enditure measures. [Hildebrand et al., 2014, Staudenma y er et al., 2015, Sasaki et al., 2011, Aguilar-F arias et al., 2019, Mortazavi et al., 2013] The study protocols to protect participan t information receiv ed the necessary appro v al from the institutional review b oard (IRB). The data were stored on a secure computer and a redacted version w as created for data sharing purp oses. The GPS and actigraph devices w ere deplo yed on a nested sample due to cost consid- erations, and then retrieved from 94 of these individuals. W e tak e a slightly roundab out approac h and fo cus instead on the tra jectories tak en b y the sub jects: w e trim the dataset further to sp ecific days and tra jectories that are b oth sufficiently active and fall within cer- tain time interv als to acquire a subset of tra jectories that con tain enough activit y to b e further extended as desired. [Loro et al., 2022] 2.1. Prepro cessing. The primary motiv ation to pursue inference on actigraph data is to build a recommender system to generate tra jectories and intensities for prospective patien ts. This is esp ecially relev an t for p eople who do not exercise for a significan t p erio d of time in 3 their day -to-day lives. W e also infer that the particular time the individual exercises has more of an impact than the particular day individual exercises, ev en more so as the days the sub jects were recorded fell into the fall and spring seasons in the W estw o od b orough of Los Angeles, where temp eratures were less likely to b e extreme. Eac h tra jectory taken b y a sub ject has its own start time and end time and rarely , if ever, in tersects temp orally with other sub jects’ tra jectories. This is b ecause sub jects are typically inclined to mo v e and exercise at their o wn chosen times. Approac hes m ust b e taken to circum v ent the sparsity and granularit y of the raw data; the sparsity that exists by the design of the data collection pro cedure and the granularit y that results from the precision of the time measurements recorded by the actigraph devices. W e define tra jectories as distinct where the end time of one tra jectory and the start time of the next on the same day is greater than 3 minutes. T o focus our analysis on the tra jectories most relev ant to recommend to prosp ective patien ts, w e also filter out tra jectories that are shorter than 5 minutes to minimize the influence of short bursts of activity , which generally amoun t to a short walk, as well as tra jectories longer than 22 min utes. W e also filter out time p oin ts for whic h the MAG is less than 0 . 05 as a precaution against small amounts of activit y that are not considered exercise, suc h as driving or riding an electric sco oter around the neighborho o d, whic h will inv olve small amoun ts of motion in the wrists that can b e registered b y the actigraph device. Finally , w e cut off the remaining tra jectories b et ween 20 and 22 min utes at the 20 min ute marks. Hence, the exercise tra jectories w e consider comprise 5 - 20 min utes of activity o v er the course of 20 second-time steps at a relativ ely consistent rate of at least minimal activity . The sparsity of the actigraph dataset is then substan tially reduced, so that the data set can b e more easily managed b y the desired algorithms. Figure 1 illustrates the change in sparsit y in the data b efore and after tra jectory transformation: Under the absolute time step treatmen t, the data w ould span appro ximately 9.0% of all possible time steps o ver eac h day cov ered b y a sub ject, not all of whic h are directly relev an t to their exercise p erio ds or regiments. 1 Under the relative time step setup, the data w ould span 45.2% of the total time p oints under consideration. This setup reduces the n um b er of sub jects to 92. Finally , w e recall from Equation (1) that the MA G is nonnegative b y nature of its for- m ulation and equals zero only when all x t , y t , and z t are zero. Since the dynamic linear mo del which we will apply to the data is designed for p ositive and negativ e outcomes, our main fo cus would b e the log(MA G), where log is the natural logarithm, to give us outcomes that may b e negative. Remo ving MAG’s less than 0 . 05 in order to build our recommender system also remov es numerical issues with taking the log of the minim um MAG (i.e. log (0)), while in principle allo wing for negative v alues below log(0 . 05) to translate in to small MAG’s, 1 Most p eople would not need a recommender system to assist in building an exercise routine that spans 7 am - 11 pm. 4 Figure 1. Left: The data cov ered b y eac h sub ject for eac h date cov ered in the dataset under the absolute time step setup, where time ranges from 7 am - 11 pm. Righ t: The same under the relativ e time step setup from start of tra jectory to up to 20 min utes later, with tra jectories going b eyond 20 min utes filtered out or cut off at the 22 min ute mark. The color corresp onds to the MA G v alue for the sub ject at a particular date for that time. Gray cells corresp ond to missing data for the sub ject-date pair at that particular ep o c h. and therefore, small activit y levels. Our choice of transformation, which can extend to other transformations of the MAG, will b e less of an issue for amortized Ba yesian inference and is merely chosen to giv e us a starting p oint for the dynamic linear mo del. 3. The A ctigraph Timesheet Prepro cessing the actigraph data set remo ves most of the data gaps (Figure 1) and rew orks the data to fit a recommender system. Ho w ever, not all tra jectories reac h the 20 minute- mark and some tra jectories contain gaps so that it do es not span every time step up to the last time step in the tra jectory . Hence, we pair sub jects with the dates they w ere active, so that we are only recording sub ject activities for the dates for which w e w ould exp ect them to b e active, and index the tra jectory further b y starting time p oint. This (sub ject, date, first time step) tuple suffices to index each tra jectory in the dataset. The existing data are then used to train the mo del. W e name this data rearrangement the actigr aph timeshe et . 3.1. Imputation Sc hema. The actigraph timesheet is amenable to discrete temp oral anal- ysis by design and groups the sp ecific dates an individual exercises with the sub ject who 5 Sub ject-Date Time 1 Time 2 · · · Time T S1 D1 A111 A112 · · · A11 T S1 D2 n/a A122 · · · A12 T . . . . . . . . . . . . . . . S2 D1 n/a n/a · · · A21 T S2 D2 n/a A212 · · · A22 T . . . . . . . . . . . . . . . Ss D n s As n s 1 As n s 2 · · · n/a T able 1. A sample actigraph timesheet with s sub jects and n s dates recorded for T discrete time steps. Sub jects and dates are paired (e.g. S1 D1 denotes sub ject 1 paired with date 1), and data is recorded at eac h time (e.g. A111 denotes the data recorded for sub ject 1 at date 1 and time 1). n/a (highlighted in grey) denotes cells where the data is missing. exercises on that day . Either dataset visualized in Figure 1 can b e expressed in the struc- ture of T able 1, a visual representation of a hypothetical dataset with missing v alues across time and sets up the dataset for the timesheet to impute. Let X t,o and y t,o b e the n t,o × p observ ed cov ariate design matrix and n t,o × 1 outcome resp ectively at time t . X t,u denotes the n t,u × p design matrix of cov ariates that we treat as known, but ma y not b e observed. y t,u is the n t,u × 1 unobserv ed outcome. F or full generalit y , supp ose that w e are w orking with the known mo del y t = g t ( θ t , X t , V t ), which is gov erned b y a v ector or matrix of other kno wn parameters V t and may b e inv olve random comp onents such as noise. Then the full imputation pro cedure for the missing time steps of the actigraph timesheet can b e listed as follo ws: (1) Compute the p osterior θ 1: T | y 1: T ,o b y regressing on the observed data y 1: T ,o using observ ed co v ariates X 1: T ,o and parameters V t,o . (2) Analytically impute the unobserv ed co v ariates X t,u , generating X ∗ t,u . 2 (3) Sample θ ( l ) 1: T ∼ p ( θ 1: T | y 1: T ,o ) from the p osterior and synthesize outcomes to impute at the time steps where the outcomes are not observ ed: y ∗ , ( l ) t,u = g t ( θ ( l ) t , X ∗ t,u , V t,u ) for t = 1 , . . . , T and l = 1 , . . . , L desired samples. Step 3 shows that m ultiple predictive samples of imputed outcomes can b e easily acquired from this framew ork by taking multiple p osterior samples of θ 1: T | y 1: T ,o and then passing eac h θ t | y 1: T ,o to g t ( · , X ∗ t,u , V t,u ) for all t . 2 This is ideally done with a pro cedure that accoun ts for time placement and geospatial lo cations, as we do in this pap er. 6 3.2. A Normal-Normal Example. T o giv e a concrete example for the actigraph timesheet pro cedure and to prepare the pro cedure for application to our use case in Section 6.4, w e will assume that our observ ations at eac h time p oint are normal at each time p oint t ; and that they share a common scale v ariance σ 2 , though the v ariances y t,o and y t,u also dep end on matrices whic h we will assume to b e known. Here, let θ t := β t , the fixed effects vector at time t ; we will assume that σ 2 is known for simplicit y . W e sp ecify g t as follows: (2) g t ( β t , X t , σ 2 V t ) := X t β t + ν t , ν t ∼ N n ( 0 , σ 2 V t ) where N n ( 0 , σ 2 V t ) is the n -dimensional multiv ariate normal distribution with mean 0 and v ariance σ 2 V t . W e assume in Equation (2) that V t are known for all t . A formulation of g t in terms of b oth observed and unobserved terms takes the form of Equation (3): (3) y † t = g t ( β t , X † t , σ 2 V † t ) = X † t β t + n † t where y † t = " y t,o y t,u # , X † t = " X t,o X t,u # , ν † t = " ν t,o ν t,u # ∼ N n t,o + n t,u ( 0 , σ 2 V † t ) , V † t = " V t,oo V t,ou V t,uo V t,uu # . Conforming to the problem setting and structure of actigraph data, n t,o and n t,u are allo w ed to v ary o v er time t . As with V t in Equation (2), we also assume that V † t and its submatrices V t,oo , V t,ou , V t,uo , and V t,uu are all known for all t . No w let’s further supp ose that β t follo ws a normal densit y , so that β t ∼ N ( m t , σ 2 M t ), and that β t are indep enden t across time t . Conjugacy gives us the p osterior β t | y t ∼ N p ( m t, post , σ 2 M t, post ), where M t, post = ( X T t,o V − 1 t,oo X t,o + M − 1 t ) − 1 and m t, post = M t, post ( X T t,o V − 1 t,oo y t,o + M − 1 t m t ) . Since w e ha ve sp ecified that β t are indep enden t across different t for this example, we don’t need to worry ab out accoun ting for outcomes from times other than t . T o construct X ∗ t,u , co v ariates which dep end on the sub ject-sp ecific paths of tra v ersal (e.g. GPS co ordinates) can b e linearly-interpolated, since the time b etw een the previous and next p oin ts are assumed to b e linearly distributed. Cov ariates which dep end on the geographical co ordinates can b e tak en directly from the asso ciated latitude-longitude pair if the coordinate pair exists in the dataset and then av eraged among lo cations in a set radius around the co ordinates using a Gaussian-w eigh ted a v erage of the p oints by distance if the pair do es not exist in the dataset. Sub ject-sp ecific information, such as age, which w as not assumed to c hange significantly during the course of the study , w as taken directly from the sub ject’s recorded information in the data set. 7 Finally , we sample L vectors from the posterior: β ( l ) t ∼ N ( m t, post , σ 2 M t, post ), l = 1 , . . . , L , and generate syn thetic outcomes y ∗ , (1: L ) t,u ∼ N n t ( X ∗ t,u β t , σ 2 V t,uu ) for t = 1 , . . . , T . Credible in terv als for eac h y t,u can b e computed b y taking quantiles for y ∗ , (1: L ) t,u across the L samples. 4. Amor tized Ba yesian Inference In one in terpretation of the transfer learning framew ork, the goalis to build a system that takes a data set y as input and builds p osterior samples θ | y of parameter estimates as output. This system generalizes across different datasets within its application domain without ov erfitting to any one of them. Hence, we w an t to train a function f ϕ with parameters ϕ so that, giv en the separate data outcomes y ( d ) , we can obtain p osterior parameter estimates θ | y ( d ) = f ϕ ( y ( d ) ) for all d = 1 , . . . , D . (W e momentarily disp ense with the time suffix t from notation in this section to address the framework in full generality .) The particular approach to learning f ϕ is called amortize d b ayesian infer enc e (ABI). The defining approach of ABI is to train f ϕ on a sequence of artificially-generated synthetic datasets, and then pro duce p osterior samples for eac h individual dataset that is fed in to the trained netw ork f b ϕ . ”Amortization” refers to the high time cost of training the neural net w ork system, which is then av eraged o v er passing multiple datasets through the trained system to yield a lo w a verage runtime. Th us, ev aluating f ϕ ( y ( d ) ) w ould no longer in volv e the high time cost of the training step in a standard machine learning framework where we train eac h data set y ( d ) separately and from scratch; the a verage time cost of training and running the system reduces (”amortizes”) ov er the num b er of times or different settings d = 1 , . . . , D the trained system is run on. T o supp ort the interpretabilit y of our samples from our trained ABI system and to give ourselv es a first step to mo del our data, we b egin with a baseline mo del defined by a prior distribution p ( θ ) and data generating distribution p ( y | θ ) (which we may combine into the join t distribution p ( y , θ )) to remov e the dep endence of f b ϕ on an y particular dataset, and to sim ultaneously provide a model that is structurally suited to the common application domain of the datasets (suc h as a time-dep endent mo del, in the con text of actigraph data). p ( y , θ ) then generates our prior and synthetic data samples, b oth of which will b e used to train f ϕ to learn the p osterior sampling function, which mandates f ϕ to tak e in θ and y . The particular metho d we utilize f ϕ is to transform θ | y to a random v ariable following a simpler, more recognizable distribution for training purp oses and back when sampling is required. The latter approac h is already widely applied to sampling from arbitrary distributions: Sampling from the distribution of an arbitrary random v ariable X typically computes F − 1 X ( U ), where F X is the cumulativ e distribution function of X and U ∼ U (0 , 1) is a sampled uniform random v ariable on the in terv al (0 , 1); theoretical developmen ts utilizing this framew ork ha v e b een adv anced by [Chen and Gopinath, 2000, T abak and V anden-Eijnden, 2010, T abak and T urner, 2013]. 8 f ϕ is called a normalizing flow on the basis of its application to bidirectional v ariable transformation. Our use of normalizing flows is inspired b y and derives from the success of existing metho ds [Rezende and Mohamed, 2015, Dinh et al., 2015, Radev et al., 2022]. 3 The metho d pursues v ariational appro ximation of the true p osterior p ( θ | y ) by the rev erse flo w p ( f − 1 ϕ ( z ; y )), where z follows the simple distribution (in our applications, a multiv ariate standard normal N ( 0 , I )), by finding the parameters ϕ that minimize the Kullbac k-Leibler (KL) Div ergence b etw een the tw o quan tities [Ren et al., 2011, Blei et al., 2017]. The ex- p ectation across all datasets y is taken for the KL Divergence to account for the underlying application of the ABI mo del to div erse datasets, and further mak es the expression more tractable: argmin ϕ E p ( y ) KL ( p ( θ | y ) || p ( f − 1 ϕ ( z ; y ))) = argmin ϕ E p ( y ) E p ( θ | y ) [log p ( θ | y ) − log p ( f − 1 ϕ ( z ; y )))] = − argmin ϕ E p ( y ) E p ( θ | y ) log p ( f − 1 ϕ ( z ; y )) = − argmin ϕ x p ( y , θ ) log p ( f − 1 ϕ ( z ; y )) d y d θ . (4) W e finish with a c hange of v ariables z = f ϕ ( θ ; y ) in the final expression to remov e the dep endence of our equation on z , which is not sampled in the training stage: − argmin ϕ x p ( y , θ ) log p ( f − 1 ϕ ( z ; y )) d y d θ = − argmin ϕ x p ( y , θ )[log p ( f ϕ ( θ ; y )) + log | det J f ϕ | ] d y d θ (5) where J f ϕ = ∂ f ϕ ( θ ; y ) /∂ θ , the Jacobian of the inv ertible netw ork in the forward direction. The final integral is then appro ximated with the Monte Carlo expression o v er M priors θ ( m ) , m = 1 , . . . , M , with their corresp onding datasets y ( m ) generated from each prior: b ϕ = − argmin ϕ 1 M M X m =1  log p ( f ϕ ( θ ( m ) ; y ( m ) )) + log | det J ( m ) f ϕ |  . Since f ϕ transforms θ into a standard normal with additional input from y , w e can treat p ( f ϕ ( θ ( m ) ; y ( m ) )) as a standard normal random v ariable and simplify log p ( f ϕ ( θ ( m ) ; y ( m ) )) to − D 2 log(2 π ) − 1 2 || f ϕ ( θ ( m ) ; y ( m ) ) || 2 , where D is the dimension of θ . F actoring in the argmin allo ws us to remov e the constan t term from our sum, giving us our ob jective function: b ϕ = argmin ϕ 1 M M X m =1  1 2 || f ϕ ( θ ( m ) ; y ( m ) ) || 2 − log | det J ( m ) f ϕ |  . (6) 3 F or a thorough treatment of normalizing flows, [see Papamak arios et al., 2021, and references therein]. 9 Algorithm 1 explicates the BayesFlow pro cedure in terms of the t wo functions TRAIN BAYESFLOW () and SAMPLE BAYESFLOW (): At eac h training iteration, M priors and syn- thetic data are generated and passed to the inference netw ork f ϕ to output Mon te Carlo samples for Equation (6). The parameters of the inv ertible neural net w ork ϕ are up dated at eac h training cycle using bac kpropagation (discussed further in Algorithm 7 and Algorithm 8). 4.1. Appro ximating Posteriors with Inv ertible Neural Net works. The transformed v ariable f b ϕ ( θ ; y ) is probabilistically indep endent of y despite taking it as an argument. This is vital for the sampling scheme to function by back-transforming standard normal random v ariables. It suffices to note that f b ϕ ( θ ; y ) = f b ϕ ( θ ; y ) | y and hence f − 1 b ϕ ( z ; y ) ≡ d θ | y (i.e. f − 1 b ϕ ( z ; y ) has the same distribution as θ | y ) b y the setup in Equation (4) under p erfect con v ergence. Since z ⊥ y by design, so do es f b ϕ ( θ ; y ) ⊥ y . While p erfect conv ergence is unrealistic as there are alwa ys errors in practice, algorithms can con verge close enough to suc h that the errors b ecome negligible. 4 While v arious uni- v ersal approximation theorems exist for neural netw orks, whic h ensure that the family of neural netw orks is dense in the function space and suffices to approximate any function, the sp ecifications needed are not known offhand and must b e exp erimen ted with. 4.2. A Normal-Gamma Example. The results of our pro of in the previous section may seem counterin tuitive since f ϕ ( θ ; y ) | y manifestly dep ends on y , and y et we ha v e found that it do es not. Define θ := { β , σ 2 } . Then y | θ = y | β , σ 2 ∼ N n ( X β , σ 2 V ), and β , σ − 2 ∼ N G p ( m , M , a 0 , b 0 ), where X is a known cov ariate design matrix, m and M are defined as in Algorithm 3, and a 0 and b 0 are the shap e and rate parameters. Then β , σ − 2 | y ∼ N G p ( m post , M post , a post , b post ), where M post = ( X T V − 1 X + M − 1 ) − 1 , µ post = M post ( X T V − 1 y + M − 1 m ) a post = a 0 + p, b post = b 0 + 1 2  y T V − 1 y + m T M − 1 m − m T post M − 1 post m post  . (7) Rescaling β to a standard normal can b e readily calculated outside the ABI framework: Let ˜ f ( β , σ 2 ; y ) b e the true transformation of { β , σ 2 } to z ∼ N p +1 ( 0 , I ). Then the elements of ˜ f corresp onding to β are straigh tforward to compute: ˜ f ( β , · ; y ) = L − 1 post ( β − m post ), where L post is the Cholesky decomp osition of the p osterior cov ariance matrix M post . F or the last elemen t of f , corresp onding to σ 2 , w e ha ve Φ − 1 (Γ( σ − 2 , a post , b post )), where Γ( σ − 2 , a, b ) is the cum ulativ e distribution function (CDF) of a gamma random v ariable with shap e a and rate b and Φ is the CDF of a standard normal random v ariable. 4 [Radev et al., 2022] sp ecifically names Mon te Carlo error and error from the inv ertible netw ork that do es not ”accurately transform the true p osterior in to the prescrib ed Gaussian latent space ”, relev ant to our use case. 10 Algorithm 1 Amortized Ba y esian Inference with the BayesFlow Metho d 1: Input: Prior generating pro cess p ( θ ), synthetic outcome generating pro cess p ( y | θ ) 2: INN f ϕ ( θ ; y ) to b e trained 3: Batc h size M , num b er of iterations to use for training N I T E R 4: function Train BayesFlow ( p ( θ ) , p ( y | θ ) , f ϕ ( θ ; y ) , M , N I T E R ) 5: for j = 1 , . . . , N I T E R do 6: for m = 1 , . . . , M do 7: Sample mo del parameters from prior θ ( m,j ) ∼ p ( θ ). 8: Generate synthetic outcomes y ( m,j ) ∼ p ( y | θ ( m,j ) ). 9: P ass ( θ ( m,j ) , y ( m,j ) ) through the inference net work in the forward direction: 10: z ( m,j ) := f ϕ ( θ ( m,j ) ; y ( m,j ) ). 11: end for 12: Compute loss according to Equation (6). 13: Up date neural netw ork parameters ϕ via backpropagation. 14: end for ▷ Conv ergence to ϕ ideally achiev ed 15: end function 16: 17: Input: Observed data y ( obs ) , trained INN f b ϕ ( θ ; y ) from TRAIN BAYESFLOW 18: Num b er of samples L 19: Output: L p osterior samples from p ( θ | y ( obs ) ) 20: function Sample BayesFlow ( y ( obs ) , f b ϕ ( θ ; y ) , L ) 21: for l = 1 , . . . , L do 22: Sample a latent v ariable instance z ( l ) ∼ N D ( 0 , I ). 23: P ass ( y ( obs ) , z ( l ) ) through the inference net work in the inv erse direction: 24: θ ( l ) = f − 1 b ϕ ( z ( l ) ; y ( obs ) ). 25: end for 26: return θ (1: L ) . 27: end function 11 Notice that a nonlinear function is required to transform σ 2 in to a standard normal. Here, a neural netw ork with sufficient no des, lay ers, and an activ ation function b etw een lay ers (for example, a ⌈ log 2 ( p + 1) ⌉ -lay er coupling flow where each entry of β is provided to facilitate computation of the p osterior parameters) is able to approximate even nonlinear functions. Empirically testing these neural net works would then facilitate the sp ecific choice of in v ertible net w ork to obtain p osterior samples from the normal-gamma. T raining via ABI aims to solv e for the parameters ϕ so that the inference netw ork f ϕ b est appro ximates ˜ f . Begin b y generating synthetic data samples from the mo del, i.e. y ( m ) | β ( m ) , σ 2 , ( m ) ∼ N n ( X β ( m ) , σ 2 , ( m ) V ) and β ( m ) , σ − 2 , ( m ) ∼ N G p ( m , M , a 0 , b 0 ), and that we correctly sp ecify V and C . Then at training, for each Mon te Carlo iteration m = 1 , . . . , M , β ( m ) , σ − 2 , ( m ) are sampled from the normal-gamma prior. Next, β ( m ) , σ − 2 , ( m ) , and y ( m ) are passed to the individual summand term in Equation (6) to supply the in tegral expression with individual Monte Carlo terms. At the end of the training cycle, ϕ is up dated via bac kpropagation. The next training cycle then rep eats with the sampling of new priors and syn thetic data. By design, E [ f ϕ ( β , σ 2 ; y )] = 0 and V ar[ f ϕ ( β , σ 2 ; y )] = I . Since the distribution of the scaled v ariable f ϕ ( β , σ 2 ; y ) is also normal, the mean and v ariance are sufficien t to fully parametrize the distribution, and despite the scaled quantit y f ϕ ( β , σ 2 ; y ) taking in the data y as an argument, the resulting v ariable transformed b y the normalizing flo w do es not ultimately dep end on the data. Consequently , to pro duce a p osterior from the randomly sampled standard normal z ∼ N p ( 0 , I ), f − 1 ϕ ( z ; y ) tak es in y as an argumen t, samples z , and the transforms z in to a p osterior sample β , σ 2 | y . Extensions across time, which will b e relev ant from the next section forw ard, can b e ac hieved by app ending the β t ’s and y t ’s v ertically and applying the pro cedure in this section the same wa y . 5. D ynamic Linear Model Ha ving cov ered the theory b ehind amortized inference, w e return to discussing the time- v arying mo del needed for the actigraph data in Section 2. Sp ecifically , we utilize a time- v arying geospatial mo del: y t = X t β t + ν t ; ν t ind ∼ N n t (0 , σ 2 V t ) β t = G t β t − 1 + ω t ; ω t ind ∼ N p (0 , σ 2 W t ) β 0 ∼ N p ( m 0 , σ 2 M 0 ); σ − 2 ∼ G ( a 0 , b 0 ) (8) where G ( a, b ) is the gamma distribution with parameters a and b . a 0 , b 0 , m 0 , and M 0 are h yp erparameters for their resp ectiv e distributions. β t and σ 2 retain their definitions from the Actigraph Timesheet setup in Equation (3), though not the structure in the normal-normal example. 12 Equation (8) is called the dynamic line ar mo del (DLM), due to its linear terms and prop- erties and temporal dependence b eing a k ey application for the model’s hierarc hical structure [W est and Harrison, 1997]. The linear prop erty makes the DLM relativ ely easy to under- stand and implement and the Marko vian prop erty enables inference to resume from the latest time p oint instead of the entire history of the fit. The DLM has the F orward Filter Bac kw ards-Sampling algorithm (FFBS) as a well-established solution to acquire the p oste- rior distribution of β 1: T , σ 2 | y 1: T (Algorithm 2). The distributions in Equation (8) also set up the distributions of the individual terms at each lev el of the mo del with the desired prop ert y of conjugacy: each term of the sequence of distributions remains in the same family of distributions across time, differing only by their particular parameters [Carter and Kohn, 1994]. Our ensuing developmen t makes use of accessible distribution theory for implement- ing FFBS for gaussian mo dels [see for example W est and Harrison, 1997, Petris et al., 2009, Banerjee et al., 2025, for tec hnical details]. Algorithm 2 F orward Filter Bac kwards Sampling algorithm 1: Input: Data y 1: T and Kalman filter starting v alues a 0 , b 0 , m 0 , M 0 2: Input: Observ ation and state transition matrices X 1: T and G 1: T 3: Input: Correlation matrices V 1: T and W 1: T 4: Output: Sample from p osterior p ( β 1: T , σ 2 | y 1: T ) 5: function FFBS ( a 0 , b 0 , y 1: T , X 1: T , G 1: T , V 1: T , W 1: T ) 6: { a t , b t , c t , C t , m t , M t } T t =1 ← Filter ( y 1: T , a 0 , b 0 , m 0 , M 0 , G 1: T , X 1: T , V 1: T , W 1: T ) 7: { β 1: T , σ 2 } ← BackwardSample ( a T , b T , { c t , C t , m t , M t , G t } T t =1 ) 8: return { β 1: T , σ 2 } ▷ Return sample from p ( β 1: T , σ 2 | y 1: T ) 9: end function 5.1. F orw ard Filter. The FFBS algorithm (Algorithm 2) consists of t wo steps: the forward filter (FF) and bac kwards sampling (BS) algorithms. The forw ard filter deriv es its name from the Kalman filter, a Mark ovian mo del that tak es in the data sequen tially and up dates the parameters at the next time step according to the data and existing parameter estimate. The density we wish to sample from is: p ( β 1: T , σ 2 | y 1: T ) ∝ p ( σ 2 | a 0 , b 0 , y 1: T ) p ( β T | y 1: T , σ 2 ) T − 1 Y t =0 p ( β t | y 1: t , β t +1 , σ 2 ) . (9) In practice, w e opt to sample from p ( β t | y 1: T , σ 2 ) to acquire full p osteriors from the pro duct series by marginalizing out β t +1 from each term in the pro duct of Equation (9) and 13 Algorithm 3 Kalman (forw ard) filter 1: Input: Data y 1: T , hyperparameters a 0 , b 0 , m 0 , M 0 2: Input: Observ ation and state transition matrices X 1: T and G 1: T 3: Input: Correlation matrices V 1: T , W 1: T 4: Output: Filtering distribution parameters at time t = 1 , . . . , T 5: function Filter ( y 1: T , a 0 , b 0 , m 0 , M 0 , G 1: T , X 1: T , V 1: T , W 1: T ) 6: for t = 1 to T do 7: # Compute prior distribution p ( β t , σ − 2 | y 1: t − 1 ) ∼ N G ( c t , C t , a ∗ t , b ∗ t ): 8: c t ← G t m t − 1 , C t ← G t M t − 1 G T t + W t 9: a ∗ t ← a t − 1 , b ∗ t ← b t − 1 10: # Compute one-step-ahe ad for e c ast p ( y t | y 1: t − 1 ) ∼ T 2 a ∗ t ( q t , b ∗ t a ∗ t Q t ) 11: q t ← X t c t , Q t ← X t C t X T t + V t 12: # Compute filtering distribution p ( β t , σ − 2 | y 1: t ) ∼ N G ( m t , M t , a t , b t ) : 13: m t ← c t + C t X T t Q − 1 t ( y t − q t ) , M t ← C t − C t X T t Q − 1 t X t C T t 14: a t ← a ∗ t + n t 2 , b t ← b ∗ t + 1 2 ( y t − q t ) T Q − 1 t ( y t − q t ) 15: end for 16: return { a t , b t , c t , C t , m t , M t } T t =1 17: end function bringing the y ( t +1): T in to parameter calculations for computational conv enience and to low er storage requirements for the samples. W e pro ceed to compute the parameters at each time step using the pro cedure outlined in Algorithm 3. Due to the sequential nature of the up dates of a t and b t in line 14, w e can rewrite a T and b T ma y b e written in compact form: a T = a 0 + 1 2 T X t =1 n t , and b T = b 0 + 1 2 T X t =1 ( y t − q t ) T Q − 1 t ( y t − q t ) . (10) Ordinarily , the parameters from the FF are not used for sampling; the FF’s main purp ose is to compute the parameters for the BS to enable full p osterior sampling. Ho w ever, it is sometimes desirable to sample from the FF as a b enchmark for comparison with other algorithms. 14 Algorithm 4 Bac kward Sampler 1: Input: Filtering parameters and inputs from Algorithm 3 2: Output: Posterior sample from p ( β 1: T , σ 2 | y 1: T ) 3: function BackwardSample ( a T , b T , { c t , C t , m t , M t , G t } T t =1 ) 4: Dra w σ − 2 ∼ G ( a T , b T ) 5: Dra w β T ∼ N ( m T , σ 2 M T ) 6: for t = T − 1 to 1 do 7: s t ← m t + M t G T t +1 C − 1 t +1 ( s t +1 − c t +1 ) 8: S t ← M t − M t G T t +1 C − 1 t +1 ( C t +1 − S t +1 ) C − 1 t +1 G t +1 M t 9: Dra w β t ∼ N ( s t , σ 2 S t ) 10: end for 11: return { β 1: T , σ 2 } 12: end function 5.2. Bac kw ards Sampling. Backw ards sampling obtains samples of eac h β t | y 1: T across time t . The p osterior mean and v ariance for β t are further refined with the data across all time; the co efficients from the FF are used to compute the parameters for smo othing. As with the FF, BS also takes adv antage of the conjugacy of the underlying distributions to efficien tly compute the underlying parameters for the full p osterior samples. The pro cedure is outlined in Algorithm 4. 5 The normal density of β t | σ 2 , y 1: T and gamma density of σ 2 | y 1: T can b e combined into a single normal-gamma density: (11) β t , σ − 2 | y 1: T ∼ N G ( s t , S t , a T , b T ) . Eac h sample from the normal gamma inv olves sampling σ − 2 ∼ G ( a T , b T ) and then β t ∼ N p ( s t , σ 2 S t ). F urthermore, if we integrate out σ 2 | y 1: T from Equation (11), we obtain: (12) β t | y 1: T ∼ T 2 a T  s t , b T a T S t  where T 2 a T ( s t , b T a T S t ) is a multiv ariate Student’s t-distribution with degrees of freedom 2 a T , mean s t and scale matrix b T a T S t . W e utilize Equation (12) to analytically compute credible in terv als at each time p oin t without sampling σ 2 | y 1: T . 5 s T = m T and S T = M T , since the parameters for p ( β T | σ 2 , y 1: T ) are given to us by the FF. 15 Algorithm 5 Generate Syn thetic Outcome from DLM 1: Input: Sample size L 2: All parameters listed in Algorithm 3 except for data y 1: T 3: Output: A single set of parameters from the DLM. 4: A sample of L syn thetic outcomes y (1: L ) 1: T follo wing the DLM given a single set of parameters. 5: function DLM Prior ( a 0 , b 0 , m 0 , M 0 , G 1: T , W 1: T ) 6: Dra w σ − 2 ∼ G ( a 0 , b 0 ) 7: Dra w β 0 ∼ N ( m 0 , σ 2 M 0 ) 8: for t = 1 to T do 9: Dra w β t ∼ N ( G t β t − 1 , σ 2 W t ) 10: end for 11: return { β 1: T , σ 2 } 12: end function 13: 14: function Y From DLM ( β 1: T , σ 2 , X 1: T , V 1: T , L ) 15: for l = 1 to L and t = 1 to T do 16: Dra w y ( l ) t ∼ N ( X t β t , σ 2 V t ) 17: end for 18: return y (1: L ) 1: T 19: end function 6. Illustra tions for ABI 6.1. Stationary Normal-Gamma. T o test the theoretical underpinnings of our results, w e use DLM PRIOR and Y FROM DLM from Algorithm 5 to generate our ground truth parameters β true , σ 2 true , and data y true , sp ecifying T = 1 and L = 1. W e normalize the cov ariates in the actigraph data at t = 1 to hav e zero mean and unit v ariance to form X ; scaling the co v ariates is essential to facilitate conv ergence of mac hine learning algorithms. W e utilize the BayesFlow (version 2.0.6) [Radev et al., 2022] softw are pac k age to run our sim ulations b ecause it con tains existing co de needed to run the ABI siulation, and set f ϕ to be a coupling flo w neural netw ork with 4 in vertible la y ers with 128-length single-lay er p erceptrons acting as sub-netw orks within each lay er. 16 Figure 2 sho ws ABI’s ability to learn the p osteriors of the normal-gamma from running Algorithm 1. Almost all the parameters (sans σ 2 | y ), the differences in the means of the distributions are small relative to the o v erall scale spanned b y the individual entries of β . The mo del manages to condense information from y and demonstrates significant ability to capture the normal-gamma mo del even within N I T E R = 5 , 000 online training cycles and a decen tly small batch size of M = 32. 6.2. Actigraph Data. W e next extend our approach in section 6.1 to T = 61. T o facilitate this extension, we retain Equation (8) as a prior mo del to generate the syn thetic data to b e learned by the amortizer, implemented as Algorithm 5. As with the previous subsection, we set N I T E R = 5 , 000 and a small batc h size of M = 32. Figure 2. 95% credible in terv als of the estimates of β 1 , σ 2 | y 1 from BayesFlow (red) and the FFBS (cy an). The means and in terv als differ slightly for all parameter estimates except σ 2 , which differs considerably . 17 T o test our approac h and run a feasible subset of our data, we subset the actigraph data to the first tra jectory for all 92 sub jects that con tains a MAG measuremen t greater than 0.1, the threshold for mo derate exercise defined by [Loro et al., 2022, Supplement]. W e test the run on a synthetic activity measure y 1: T sampled using Algorithm 5 with L = 1, a 0 = 3, b 0 = 1, m 0 = 0 , M 0 = I p , G t = I p , W t = I p , and V t = I n t for t = 1 , . . . , T : M 0 is set to iden tity to corresp ond to the prior claim that the effects of the co v ariates on fitness lev el are not correlated with one another; V t = I n t for all t is justified by the exp ected lac k of correlation b etw een different sub jects walking or running their individual tra jectories at differen t times, assuming that none of them wen t running with any of the other sub jects. As with the stationary normal-gamma mo del, we facilitate fast conv ergence of the training algorithm by separately scaling each column of the cov ariates X t for each t so that each column is cen tered around zero with unit v ariances for eac h time p oint t b efore running Algorithm 6. 6 W e also divide up the dataset in to B = 45 sets of time interv als whic h span the set of time steps expressed by our dataset to further facilitate computational efficiency and stabilit y: 41 temp oral singletons for t = 1 , . . . , 41, and then 5-time step in terv als up to t = 61 to group together the sparsest tra jectories for the end of analysis. The idea is that for eac h separate b = 1 , . . . , B , we separately fit neural netw orks to subsets of the data - b oth corresp onding to earlier time in terv als and using priors encompassing the simulations in the earlier subset to supply the priors for the later subset. W e set f ϕ to b e a coupling flo w neural netw ork with 4 inv ertible lay ers for the temp oral singletons and 6 la yers for the last four blo c ks of 5 time steps, eac h lay er comprised of 128-length single-la y er p erceptrons acting as sub- net w orks within the inv ertible lay ers to sufficiently capture the minimum required n um b er of parameters. F or b > 1, we sample from a prior informed by the theoretical FF parameters where the outcome y 1: T is generated from the DLM with the parameters sp ecified in this subsection. The theoretical deriv ation of the parameters for the appropriate priors is detailed in Section A. 7.52 hours w ere sp en t training the ABI system under these settings. The credible in terv als of nine of the cov ariates across time are displa yed in Figure 3: while generally wider, the ABI interv als manage to capture the trend of the FFBS and true parameters. Additional precision ma y b e obtained by rerunning the net w ork with a greater batc h size M , which adds more Monte Carlo samples to approximate the loss but with a prop ortionately greater time cost. Note that the mo del fitted using Algorithm 6 learns the p osteriors of a sligh tly different mo del than that of Equation (8): the σ 2 generated at each time step is allo wed to differ from the others for eac h batch rather than b eing strictly shared across time (though the shap e and rate used to generate it remain the same throughout), and information from the output 6 While separate scaling for each t would lead to different interpretations of β t across time t , capturing the original common in terpretation of each β t can b e obtained b y rescaling X t to its original scale. 18 (a) Age (b) Alt. (c) BMI (d) Dist. from Home (km) (e) Dist. to Parks (km) (f) Start Time in Da y (g) NDVI (h) Sex (i) Slop e Figure 3. The estimated parameter tra jectories (red) due to Algorithm 6, estimated tra jectories due to the FFBS (cyan), the FF only (green), and true tra jectories (blue) of the co efficients used to generate the synthetic outcomes at each time step. The true cov erage rate of the synthetic outcomes after the FFBS pro cesses the outcome is ab out 97.6%. The credible interv als for the ABI output were tak en from the 2.5% and 97.5% quantiles of 10,000 samples generated from the trained net w orks. A horizontal line is added at β t,j = 0 for non-intercept elemen ts j to visualize the statistical significance of each parameter ov er time. 19 of future time steps do es not mak e it back to the parameters of past time steps. Ho w ever, the mo del’s results are close enough to that of the DLM and its fitted parameters are close enough to those generated by the FFBS that it is apparent that the Algorithm 6 manages to learn the DLM and the parameters of the FFBS with only sligh tly wider credible interv al bands. This is apparent through Figure 3, as well as through the figures in further sections. Algorithm 6 Hierarc hical ABI on the Dynamic Linear Mo del with Time Interv als 1: Input: A set of B predefined time interv als { [ T b − 1 , T b ] } B b =1 2: All parameters listed in Algorithm 3 except for data y 1: T 3: Batc h size M . 4: for b = 1 , . . . , B do 5: T rain neural net works for blo ck b (corresp onding to [ T b − 1 , T b ]): 6: Train BayesFlow ( DLM Prior ( a 0 , b 0 , m 0 , M 0 + ( T b − 1 − 1) I p , G T b − 1 : T b , W T b − 1 : T b ), 7: Y From DLM ( β T b − 1 : T b , σ − 2 b , X T b − 1 : T b , V T b − 1 : T b , 1), 8: f ϕ b ( β T b − 1 : T b , σ − 2 b ; y T b − 1 : T b ), M ) 9: end for 6.3. Prediction of New Actigraph T ra jectories. It is also relev an t to determine whether the FFBS and ABI can accurately repro duce actigraph tra jectories that were not part of its training set to determine its generalizabilit y . W e randomly select a small subset of eigh t tra jectories that were not part of the training dataset, totaling 233 observ ations, and attempt to predict their synthetic v alues with the fitted parameters from b oth algorithms. T o chec k the extent of the generalizabilit y of our algorithms, we also extend this assessmen t to the en tire set of 2,102 tra jectories held out from the training dataset, totaling 57,786 observ ations. The results are shown in Figure 4, where w e do so successfully with b oth FFBS and ABI. Since the subset of tra jectories we hav e trained on yield output that w as generated through iden tical pro cedures and is thus representativ e of the data, the p osteriors β 1: T , σ 2 | y 1: T that are computed are able to generalize w ell to the entire set of tra jectories held out from training. 6.4. Actigraph Timesheet Imputation. W e assess the suitability of the Actigraph Timesheet imputation strategy: W e b egin by imputing the co v ariate matrices to fill in the gaps in time up to the last recorded time step of eac h tra jectory; going further risks 20 Figure 4. The credible in terv als compared with their synthetic v alues gen- erated from a single run of Y FROM DLM from Algorithm 5 for a subset of the actigraph data not presen t in the training data (left) and the whole actigraph data not presen t (right). The cov erage rates (ranging from 0 to 1) for eac h setting are also depicted in each plot, with FFBS in blue with cyan text and ABI in pink with red text. The FFBS in terv als serve as a reference for pre- diction, with ABI’s visibly wider and prop ortionately longer than the FFBS in terv als dep ending on the latter’s length. biasing the outcome of the co v erage. 7 Letting V t,oo = I n t,o and V t,u = I n t,u for t = 1 , . . . , T 8 , w e then run DLM PRIOR with its sp ecified parameters once to obtain the ground truth parameters and compute one run each of Y FROM DLM ( β 1: T , σ 2 , X 1: T ,o , V 1: T ,oo , 1) and Y FROM DLM ( β 1: T , σ 2 , X ∗ 1: T ,u , V 1: T ,uu , 1) to generate synthetic outcomes y 1: T ,o and y 1: T ,u that align with the DLM as a metho d of test. Both to leverage the transferabilit y of ABI and to demonstrate its flexibility , we allow for y 1: T ,o generated in this setting to differ from that in Section 6.3. The analytical details for the imputation strategy for X ∗ 1: T ,u are contained in Section B. W e then acquire p osterior samples from b oth ABI and the DLM using only the observed outcomes and cov ariates X 1: T ,o , generate predictions for the unobserv ed synthetic outcomes using the posterior samples from the existing data y 1: T ,o , and compare the prediction samples y ∗ (1: L ) 1: T ,u to the syn thetic outcomes y 1: T ,u . The results are sho wn in Figure 5, where b oth in terv als cov er a substantial p ortion of the imputed data. 7 The num b er of time steps to b e imputed under this setting comprises ab out 17.4% of the combined real and imputed data, i.e. P t n t,u / P t ( n t,u + n t,o ) ≈ 0 . 174. 8 n t,u = 0 yields a trivial cov ariance matrix and is not computationally relev ant, since every sub ject’s tra jectory at that time step is measured (e.g. t = 1). 21 Figure 5. The credible in terv als compared with their synthetic v alues gen- erated from Equation (8) for Actigraph Timesheet imputed data. 6.5. Case Study. W e next explore the p erformance of ABI on the real data against the log(MAG): w e pass it on a full run through the FFBS, as w ell as to the trained ABI. W e retain the co v ariates from training in Section 6.2. Note that training was completed assuming underlying knowledge of the same set of cov ariates X 1: T ,o ; the main difference here will b e that the outcome is recorded and p ostprocessed from a real device. The results are sho wn in Figure 6, with corresp onding parameter estimates in Figure 7: the credible interv als obtained through ABI app ear to b e within a similar range as the FFBS’s, as they did for Figure 3. β t, Start Time in Da y sho ws some significance for the FFBS around t = 10 up to around t = 27, though the significance is lost for ABI. Interestingly , β t, BMI demonstrates some significance in early time steps b efore losing it later, while β t, Sex is b orderline insignificant at around the t = 22 or 23. 6.6. Computing En vironmen ts and Co de. Computations w ere executed on a laptop running 64-bit Windo ws 11 with a 12th Gen In tel(R) Core(TM) i7-12700H, 2.30 GHz pro- cessor, 32.0 GB RAM at 4800 MHz, 6 GB Graphics Card equipp ed with m ultiple GPUs. The plots were pro duced using R. All computer programs required to pro duce the numer- ical results in this man uscript are a v ailable from the Github rep ository h ttps://gith ub.com/Daniel-Zhou-93/Amortized-Bay esian-Inference-on-Actigraph-Data. 7. Discussion W e ha v e extended the original domain of ABI problems to learn time-v arying co efficien ts, whic h increase in dimension as the n umber of time steps increases. This particular scenario is additionally adaptable if data from even later time steps is desired, as it is p ossible to sample from the previous trained time step to train the next time step or collection of time steps as the domain increases. Due to the relatively small sizes of the time interv als, doing 22 Figure 6. The predictive p erformance of the FFBS (cy an) and the trained ABI (red) on the log (MA G). Both the cov erage rates and ov erall rates of credible interv als are similar, sho wing ho w ABI is able to adapt to the b eha vior of the FFBS even after b eing fed new data to form a p osterior distribution. so to generalize to larger T ma y b e done without ha ving to rerun the entire algorithm, as long as the underlying tra jectories remain the same in existing time in terv als whic h were a part of training. The adaptabilit y of the trained ABI mo del has b een demonstrated when m ultiple differen t outcome v ariables were passed in and used to generate p osterior samples of the parameters for each outcome, including with the ph ysically-measured log(MAG), with results closely approximating those of the analytic approach. W e ha v e also designed the actigraph timesheet as a mo del-based imputation pro cedure and v alidated its p erformance on data conforming to the underlying DLM. Its flexibilit y across mo dels is leveraged by our use of ABI, with the potential to extend it to more complex mo dels or other machine learning implemen tations. A central limitation of the present framew ork is its reliance on the Marko vian graphical structure. Many complex systems exhibit path-dep enden t or long-range interactions that cannot b e captured b y lo cal conditional indep endence assumptions. Extending the FFBS to non-Marko vian graphical mo dels, where dep endence ma y propagate b ey ond immediate neigh b orho o ds, p oses b oth conceptual and tec hnical c hallenges, particularly in dev eloping tractable factorization pro cedures [P eruzzi et al., 2022]. Sp ecific blo ck-dependent Mark ovian structures such as the season-episo de mo del [Banerjee et al., 2025], hav e b een shown to scale the FFBS algorithm to massiv e datasets [also see Presicce and Banerjee, 2026]. ABI has greater capability to generalize to graphs, and promising approaches ha ve b een demonstrated through graph neural netw orks and meta-learning methods [Sainsbury-Dale et al., 2025, Ortega et al., 2019]. 23 (a) Age (b) Alt. (c) BMI (d) Dist. from Home (km) (e) Dist. to Parks (km) (f) Start Time in Da y (g) NDVI (h) Sex (i) Slop e Figure 7. The estimated parameter v alues regressing on log (MA G) ov er time due to Algorithm 6, and estimated tra jectories due to the FF only (green) and the FFBS (cyan) of the co efficients at eac h time p oin t. The credible interv als for the ABI output were generated using the same pro cedure for Figure 3. A horizon tal line is added at β t,j = 0 for non-intercept elements j to visualize the statistical significance of eac h parameter o v er time. F uture work will also inv estigate the sp ecialized sto c hastic pro cesses describ ed in [W ak ay ama and Banerjee, 2024] to amortize Ba yesian inference for spatial-temp oral data from wearable devices. In addition to BayesFlow , w e intend to pursue sup ervised training 24 of deep learning netw orks. Supervision typically requires feeding p osterior distributions (or summaries such as means, standard deviations, quantiles, etc.) to the net work. How ever, the sup ervision of deep net w orks for suc h complex models will require large num b ers of sim ulated Ba yesian data analysis, rendering iterative algorithms such as MCMC or INLA unsuitable. Here, recen t promising developmen ts in Ba yesian predictive stacking [Zhang et al., 2025, Pan et al., 2025] for geostatistical inference can b e employ ed to sup ervise deep net w orks [Presicce and Banerjee, 2025]. These directions collectiv ely p oint to w ard a flex- ible y et rigorously grounded framew ork for mo deling complex dependence and impro ving predictiv e p erformance. 8. A cknowledgements The authors ackno wledge funding from the National Science F oundation (NSF) through gran t DMS-2113778 and from the National Institute of Health (NIH/NIGMS) through NIGMS R01GM148761. The authors thank Xiang Chen for his help in the initial pro cessing and visualization of the actigraph dataset. References Nicolas Aguilar-F arias, GMEE P eeters, Robert J Bryc hta, Kong Y Chen, and W endy J Bro wn. Comparing actigraph equations for estimating energy exp enditure in older adults. Journal of sp orts scienc es , 37(2):188–195, 2019. Sudipto Banerjee, Xiang Chen, Ian F rank en burg, and Daniel Zhou. Dynamic Ba y esian learning for spatiotemp oral mechanistic mo dels. Journal of Machine L e arning R ese ar ch , 26(146):1–43, 2025. URL http://jmlr.org/papers/v26/22- 0896.html . Margaret Bank er and Peter X. K. Song. Sup ervised learning of ph ysical activity features from functional accelerometer data. IEEE Journal of Biome dic al and He alth Informatics , 27(12):5710–5721, 2023. Christopher M Bishop. Pattern r e c o gnition and machine le arning . Springer, 2006. Da vid M Blei, Alp Kucukelbir, and Jon D McAuliffe. V ariational inference: A review for statisticians. Journal of the Americ an statistic al Asso ciation , 112(518):859–877, 2017. C. K. Carter and Rob ert Kohn. On gibbs sampling for state space mo dels. Biometrika , 81 (3):541–553, 1994. Hsin-w en Chang and Ian W McKeague. Empirical likelihoo d-based inference for functional means with application to w earable device data. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 84(5):1947–1968, 2022. Scott Chen and Ramesh Gopinath. Gaussianization. In T. Leen, T. Dietterich, and V. T resp, editors, A dvanc es in Neur al Information Pr o c essing Systems , volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper_files/paper/2000/file/ 3c947bc2f7ff007b86a9428b74654de5- Paper.pdf . 25 Lauren t Dinh, Da vid Krueger, and Y osh ua Bengio. Nice: Non-linear indep enden t comp o- nen ts estimation. , 2015. URL . Lauren t Dinh, Jascha Sohl-Dic kstein, and Samy Bengio. Density estimation using real nvp. arXiv:1605.08803v3 , 2017. URL . Aiden Doherty , Dan Jack son, Nils Hammerla, Thomas Pl¨ otz, P atrick Olivier, Malcolm H Granat, T om White, Vincent T V an Hees, Mic hael I T renell, Christop er G Ow en, et al. Large scale p opulation assessmen t of physical activit y using wrist w orn accelerometers: The uk biobank study . PloS one , 12(2), 2017. Adam Drewno wski, James Buszkiewicz, Anju Aggarwal, Chelsea Rose, Shilpi Gupta, and Annie Bradsha w. Ob esity and the built environmen t: A reappraisal. Ob esity , 28(1):22– 30, 2020. doi: https://doi.org/10.1002/ob y .22672. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/oby.22672 . MVHV Hildebrand, V AN Hees VT, Bjorge Hermann H ansen, and ULF Ekelund. Age group comparabilit y of raw accelerometer output from wrist-and hip-w orn monitors. Me dicine and scienc e in sp orts and exer cise , 46(9):1816–1824, 2014. P eter James, Marta Jank owsk a, Christine Marx, Jaime E. Hart, Da vid Berrigan, Jacque- line Kerr, Philip M. Hurvitz, J. Aaron Hipp, and F rancine Laden. “spatial energet- ics”: In tegrating data from gps, accelerometry , and gis to address ob esity and inactivit y . A meric an Journal of Pr eventive Me dicine , 51(5):792–800, 2016. ISSN 0749-3797. doi: h ttps://doi.org/10.1016/j.amepre.2016.06.006. URL https://www.sciencedirect.com/ science/article/pii/S0749379716302276 . Doh yung Kim, Jinw o o Lee, Mo o Kyun P ark, and Seung Hw an Ko. Recen t developmen ts in w earable breath sensors for healthcare monitoring. Communic ations Materials , 5(1):41, Mar 2024. ISSN 2662-4443. Diederik P . Kingma and Jimmy Ba. Adam: A metho d for sto chastic optimization. arXiv:1412.6980v9 , 2017. URL . Diederik P . Kingma and Max W elling. Auto-enco ding v ariational ba yes. , 2013. URL . PierF rancesco Alaimo Di Loro, Marco Mingione, Jonah Lipsitt, Christina M. Batteate, Mic hael Jerrett, and Sudipto Banerjee. Ba y esian hierarc hical mo deling and analysis for actigraph data from wearable devices. Annals of Applie d Statistics , 2022. Ily a Loshchilo v and F rank Hutter. Sgdr: Sto c hastic gradien t descen t with warm restarts. arXiv:1608.03983v5 , 2017. URL . Lan Luo, Jingshen W ang, and Emily C Hector. Statistical inference for streamed longitudinal data. Biometrika , 110(4):841–858, 02 2023. ISSN 1464-3510. Jairo H Migueles, Cristina Cadenas-Sanc hez, Ulf Ekelund, Christine Delisle Nystr¨ om, Jose Mora-Gonzalez, Marie L¨ of, Idoia Laba yen, Jonatan R Ruiz, and F rancisco B Ortega. Accelerometer data collection and pro cessing criteria to assess ph ysical activit y and other 26 outcomes: A systematic review and practical considerations. Sp orts me dicine , 47(9):1821– 1845, 2017. Bobak Mortaza vi, Nabil Alsharufa, Sunghoon Iv an Lee, Mars Lan, Ma jid Sarra fzadeh, Mic hael Chronley , and Christian K. Rob erts. Met calculations from on-b ody accelerome- ters for exergaming mov ements. In 2013 IEEE International Confer enc e on Bo dy Sensor Networks , pages 1–6, 2013. doi: 10.1109/BSN.2013.6575520. Kevin P Murphy . Machine L e arning: A Pr ob abilistic Persp e ctive . MIT press, 2012. P edro A. Ortega, Jane X. W ang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Raz- v an Pascan u, Nicolas Heess, Jo el V eness, Alex Pritzel, P ablo Sprechmann, Siddhant M. Ja y akumar, T om McGrath, Kevin Miller, Mohammad Azar, Ian Osband, Neil Rabinowitz, Andr´ as Gy¨ orgy , Silvia Chiappa, Simon Osindero, Y ee Wh ye T eh, Hado v an Hasselt, Nando de F reitas, Matthew Botvinick, and Shane Legg. Meta-learning of sequential strategies. arXiv:1905.03030v2 , 2019. URL . Soum y ak anti P an, Lu Zhang, Jonathan R Bradley , and Sudipto Banerjee. Bay esian inference for spatial-temp oral non-gaussian data using predictive stac king. Bayesian Analysis , 1(1): 1–27, 2025. George Papamak arios, Eric Nalisnic k, Danilo Jimenez Rezende, Shakir Mohamed, and Bala ji Lakshminara y anan. Normalizing flows for probabilistic mo deling and inference. Journal of Machine L e arning R ese ar ch , 22(57):1–64, 2021. URL http://jmlr.org/papers/v22/ 19- 1028.html . Mic hele Peruzzi, Sudipto Banerjee, and Andrew O. Finley . Highly scalable Bay esian geosta- tistical mo deling via meshed gaussian pro cesses on partitioned domains. Journal of the A meric an Statistic al Asso ciation , 117(538):969–982, 2022. doi: 10.1080/01621459.2020. 1833889. URL https://doi.org/10.1080/01621459.2020.1833889 . Gio v anni Petris, Sonia Petrone, and Patrizia Campagnoli. Dynamic Line ar Mo dels wi th R . Springer, New Y ork, NY, 2009. Guy Plasqui and Klaas R W esterterp. Ph ysical activity assessment with accelerometers: An ev aluation against doubly lab eled wa ter. Ob esity , 15(10):2371–2379, 2007. Luca Presicce and Sudipto Banerjee. Bay esian geostatistics using predictiv e stac king. arXiv:2410.09504v3 , 2025. URL . Luca Presicce and Sudipto Banerjee. Adaptive marko vian spatiotemp oral transfer learning in m ultiv ariate Ba yesian mo deling. , 2026. URL abs/2602.08544 . Stefan T. Radev, Ulf K. Mertens, Andreas V oss, Lynton Ardizzone, and Ullric h K¨ othe. Ba y esflow: Learning complex sto chastic mo dels with in vertible neural net w orks. IEEE T r ansactions on Neur al Networks and L e arning Systems , 33(4), 2022. 27 Qian Ren, Sudipto Banerjee, Andrew O Finley , and James S Ho dges. V ariational Bay esian metho ds for spatial data analysis. Computational statistics & data analysis , 55(12):3197– 3217, 2011. Danilo Rezende and Shakir Mohamed. V ariational inference with normalizing flo ws. In F rancis Bac h and Da vid Blei, editors, Pr o c e e dings of the 32nd International Confer enc e on Machine L e arning , volume 37 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1530– 1538, Lille, F rance, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/ v37/rezende15.html . Da vid E Rumelhart, Geoffrey E Hin ton, and Ronald J Williams. Learning represen tations b y back-propagating errors. Natur e , 323:533–536, 1986. doi: 10.1038/323533a0. URL https://doi.org/10.1038/323533a0 . Matthew Sainsbury-Dale, Andrew Zammit-Mangion, and Rapha¨ el Huser. Likelihoo d-free parameter estimation with neural bay es estimators. The A meric an Statistician , 78(1), 2024. Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Jordan Ric hards, and Rapha ¨ el Huser. Neural bay es estimators for irregular spatial data using graph neural netw orks. Journal of Computational and Gr aphic al Statistics , 34(3):1153–1168, 2025. doi: 10.1080/10618600. 2024.2433671. URL https://doi.org/10.1080/10618600.2024.2433671 . Jeffer E Sasaki, Dinesh John, and Patt y S F reedson. V alidation and comparison of actigraph activit y monitors. Journal of scienc e and me dicine in sp ort , 14(5):411–416, 2011. Robb y S Sikk a, Michael Baer, Av ais Ra ja, Mic hael Stuart, and Marc T ompkins. Analytics in sp orts medicine: Implications and resp onsibilities that accompany the era of big data. JBJS , 101(3):276–283, 2019. John Staudenma y er, Shai He, Amanda Hick ey , Jeffer Sasaki, and Patt y F reedson. Metho ds to estimate asp ects of physical activity and sedentary b ehavior from high-frequency wrist accelerometer measurements. Journal of applie d physiolo gy , 119(4):396–403, 2015. Esteban G T abak and Cristina V T urner. A family of nonparametric density estimation algorithms. Communic ations on Pur e and Applie d Mathematics , 66(2):145–164, 2013. Esteban G T abak and Eric V anden-Eijnden. Densit y estimation b y dual ascen t of the log- lik eliho o d. Communic ations in Mathematic al Scienc es , 8(1):217–233, 2010. T omoy a W ak ay ama and Sudipto Banerjee. Pro cess-based inference for spatial energetics using Bay esian predictive stac king. , 2024. URL https://arxiv. org/abs/2405.09906 . P aul W erbos. Applications of adv ances in nonlinear sensitivity analysis. Pr o c e e dings of the 10th IFIP Confer enc e , 38:762–770, 1982. Mik e W est and Jeff Harrison. Bayesian F or e c asting and Dynamic Mo dels . Springer, New Y ork, NY, 1997. 28 Andrew Zammit-Mangion, Matthew Sainsbury-Dale, and Rapha ¨ el Huser. Neural meth- o ds for amortized inference. A nnual R eview of Statistics and Its Applic ation , 12 (V olume 12, 2025):311–335, 2025. ISSN 2326-831X. doi: https://doi.org/10.1146/ ann urev- statistics- 112723- 034123. URL https://www.annualreviews.org/content/ journals/10.1146/annurev- statistics- 112723- 034123 . Lu Zhang, W enpin T ang, and Sudipto Banerjee. Ba yesian geostatistics using predictiv e stac king. Journal of the Americ an Statistic al Asso ciation , (in press):1–13, 2025. doi: h ttps://doi.org/10.1080/01621459.2025.2566449. Appendix A. Using Priors to Bridge Time Segments for ABI Since ABI is trained with syn thetic data, and our syn thetic data is generated from a mo del with an analytic solution, we consider the analytic prop erties of the parameters that generate the data. Specifically , we consider the v alues of E y 1: t , β t ,σ 2 [ β t ] = E y 1: t [ E β t ,σ 2 | y 1: t [ β t ]] and V ar y 1: t , β t ,σ 2 ( β t ), where we treat y 1: t (as an instance of y ( m ) 1: T in Section 4) as random rather than fixed: E β t | σ 2 ,y 1: t [ β t ]] = m t = c t + C t X T t Q − 1 t ( y t − q t ) E y 1: t , β t ,σ 2 [ β t ] = E y 1: t [ E σ 2 | y 1: t [ E β t | σ 2 ,y 1: t [ β t ]]] = E y 1: t [ m t ] = E y 1:( t − 1) [ E y t | y 1:( t − 1) [ c t + C t X T t Q − 1 t ( y t − q t )]] = E y 1:( t − 1) [ G t m t − 1 ] (13) W e note that E y t | y 1:( t − 1) [ C t X t Q − 1 t ( y t − q t )] = 0 b ecause from Algorithm 3, y t | y 1:( t − 1) ∼ T ( q t , b t − 1 a t − 1 Q t ). W e expand c t = G t m t − 1 for clarity . Recursion of the last tw o lines of Equation (13) from y t − 1 do wn to y 1 giv es us: (14) E y 1: t , β t ,σ 2 [ β t ] = t Y s =1 G s ! m 0 where Q t s =1 G s = G t G t − 1 · · · G 1 is tak en with respect to left-multiplication (i.e. the pro duct series expands leftw ards). W e pro ceed to V ar y 1: t , β t ,σ 2 ( β t ): V ar y 1: t , β t ,σ 2 ( β t ) = E y 1: t , β t ,σ 2 [ β t β T t ] − E y 1: t , β t ,σ 2 [ β t ] E y 1: t , β t ,σ 2 [ β t ] T E y 1: t , β t ,σ 2 [ β t β T t ] = E σ 2 [ E y 1: t | σ 2 [ E β t | σ 2 , y 1: t [ β t β T t ]]] = E σ 2 [ E y 1: t | σ 2 [ σ 2 M t + m t m T t ]] = b 0 a 0 − 1 M t + E σ 2 [ E y 1: t | σ 2 [ m t m T t ]] (15) 29 Next, we simplify E σ 2 [ E y 1: t | σ 2 [ m t m T t ]]. W e briefly set aside the exp ectation with resp ect to σ 2 and pro ceed with the same recursiv e strategy as with Equation (13): E y 1: t | σ 2 [ m t m T t ] = E y 1:( t − 1) | σ 2 [ E y t | y 1:( t − 1) ,σ 2 [ m t m T t ]] E y t | y 1:( t − 1) ,σ 2 [ m t m T t ] = E y t | y 1:( t − 1) ,σ 2 [( G t m t − 1 + C t X T t Q − 1 t ( y t − q t )) ( G t m t − 1 + C t X T t Q − 1 t ( y t − q t )) T ] = E y t | y 1:( t − 1) ,σ 2 [ G t m t − 1 m T t − 1 G T t + G t m t − 1 ( y t − q t ) T Q − 1 t X t C t + C t X T t Q − 1 t ( y t − q t ) m T t − 1 G T t + C t X T t Q − 1 t ( y t − q t )( y t − q t ) T Q − 1 t X t C t ] = G t m t − 1 m T t − 1 G T t + C t X T t Q − 1 t E y t | y 1:( t − 1) ,σ 2 [( y t − q t )( y t − q t ) T ] Q − 1 t X t C t = G t m t − 1 m T t − 1 G T t + σ 2 C t X T t Q − 1 t X t C t (16) Recursiv ely taking the exp ectation of G t m t − 1 m T t − 1 G T t and its analogous terms giv es us the following expression: E y 1: t | σ 2 [ m t m T t ] = t Y s =1 G s ! m 0 m T 0 t Y s =1 G s ! T + σ 2 t X r =1 t Y s = r +1 G s ! C r X T r Q − 1 r X r C r t Y s = r +1 G s ! T (17) The exp ectation ov er σ 2 is trivial. W e hav e all the terms we need to compute V ar y 1: t , β t ,σ 2 ( β t ): V ar y 1: t , β t ,σ 2 ( β t ) = b 0 a 0 − 1 M t + t X r =1 t Y s = r +1 G s ! C r X T r Q − 1 r X r C r t Y s = r +1 G s ! T ! + t Y s =1 G s ! m 0 m T 0 t Y s =1 G s ! T − E y 1: t , β t ,σ 2 [ β t ] E y 1: t , β t ,σ 2 [ β t ] T = b 0 a 0 − 1 M t + t X r =1 t Y s = r +1 G s ! C r X T r Q − 1 r X r C r t Y s = r +1 G s ! T ! (18) A.1. Simplification Under Chosen Parameters. In the context of our pap er, we c ho ose G t = I p for all t . This dramatically simplifies Equations (14) and (18) into the resp ectiv e expressions: (19) E y 1: t , β t ,σ 2 [ β t ] = m 0 30 V ar y 1: t , β t ,σ 2 ( β t ) = b 0 a 0 − 1 M t + t X r =1 C r X T r Q − 1 r X r C r ! = b 0 a 0 − 1 C t − C t X T t Q − 1 t X t C t + t X r =1 C r X T r Q − 1 r X r C r ! = b 0 a 0 − 1 G t M t − 1 G T t + W t + t − 1 X r =1 C r X T r Q − 1 r X r C r ! = b 0 a 0 − 1 M t − 1 + W t + t − 1 X r =1 C r X T r Q − 1 r X r C r ! (20) Recursiv ely simplifying equation (20) results in the following simplification: (21) V ar y 1: t , β t ,σ 2 ( β t ) = b 0 a 0 − 1 M 0 + t X r =1 W r ! Finally , we simplify to V ar y 1: t , β t ,σ 2 ( β t ) = b 0 ( t +1) a 0 − 1 I p , since w e also set M 0 = I p and W t = I p for all t . More complicated expressions can b e acquired for differen t choices of M 0 and W t (as well as for G t ). This prior ma y then b e used to bridge time segments to allo w for theoretically independent training b et ween segmen ts. F urthermore, in settings where expressions for the prior for the next time segment ma y b e intractable, it may b e p ossible to use the iterated exp ectations to appro ximate the desired parameters for the priors for the next time segmen t, such as with Mon te Carlo metho ds. Appendix B. Anal ytic Imput a tion Specifica tions W e detail the imputation pro cedure here used to generate X ∗ 1: T ,u . Each en try in X 1: T ,o is indexed by sub ject ID, time step t , and Latitude and Longitude co ordinates. The first and most basic step is to accoun t for sub ject- and tra jectory-sp ecific cov ariates, b ecause these can simply b e substituted into the corresp onding columns in X ∗ 1: T ,u without need for an y creative imputation strategy . These columns corresp ond to the sub ject’s Age, BMI, and Sex, and the relative time of day from 7 am - 11 pm that the sub ject’s tra jectory started. While the P AST A-LA dataset records sub ject tra jectories ov er the course of t wo separate t w o-week p erio ds p er sub ject differing by ab out 6 months, it only accounts for the sub ject’s Age and BMI at the start of the sub ject’s entry and treats them as constant throughout the course of the study . The last v ariable, the relative time of da y the sub ject b egan to mo v e, is constructed based on the first time the tra jectory is recorded, and by its definition is constant p er tra jectory p er sub ject. 31 T o prepare ourselv es to detail the imputation strategy , w e b orro w terminology from W ak ay ama and Banerjee [2024] and define γ i,t 0 ( t ) for the geographical tra jectory indexed b y the sub ject i and starting date and time t 0 at time t . γ i,t 0 ( t ) enco des the Latitude and Longitude co ordinates at t time steps from t 0 . W e also define the set of relativ e time p oin ts of interest as τ = { 1 , . . . , 61 } , and the set of known and unkno wn time p oin ts for sub ject i and starting date and time t 0 as τ i,t 0 ,o and τ i,t 0 ,u resp ectiv ely . γ i,t 0 then has its geo co ordinates linearly imputed for times t ∈ τ i,t 0 ,u , resulting in the im- puted tra jectory γ ∗ i,t 0 ( t ) that returns geo co ordinates at observed times in τ i,t 0 ,o and imputed geo co ordinates at unobserved times in τ i,t 0 ,u : (22) γ ∗ i,t 0 ( t ) = t − t 1 t 2 − t 1 ( γ i,t 0 ( t 2 ) − γ i,t 0 ( t 1 )) + γ i,t 0 ( t 1 ) where t 1 := max { t ∗ | t ∗ < t and t ∗ ∈ τ i,t 0 ,o } , t 2 := min { t ∗ | t ∗ > t and t ∗ ∈ τ i,t 0 ,o } Call the set of columns to b e imputed with Gaussian radial av eraging I , and G ( γ , r S ) := { γ j,t ′ 0 ( t ′ ) | hav( γ j,t ′ 0 ( t ′ ) , γ ) < r S , ∀ j, t ′ 0 , t ′ } , the set of all geo co ordinates across all sub jects, starting tra jectory times, and times spanned by the tra jectories in the observed data within an r S radius from the co ordinates γ , and X t,o, I , ( i,t 0 ) the en tries of X t,o for columns I and ro ws corresp onding to sub ject i at their start time t 0 . X ∗ 1: T , I ,u , the v alues of columns I for X ∗ 1: T ,u are then imputed for all observed lo cations within some radial distance r S from γ i,t 0 ( t ) for each i and t 0 with the following Gaussian radial av eraging pro cedure: (23) X ∗ t,u, I , ( i,t 0 ) = P T t ′ =1  P γ j,t ′ 0 ( t ′ ) ∈ G ( γ ∗ i,t 0 ( t ) ,r S ) w ( γ j,t ′ 0 ( t ′ ) , γ ∗ i,t 0 ( t )) X t ′ ,o, I , ( j,t ′ 0 )  P T t ′ =1  P γ j,t ′ 0 ( t ′ ) ∈ G ( γ ∗ i,t 0 ( t ) ,r S ) w ( γ j,t ′ 0 ( t ′ ) , γ ∗ i,t 0 ( t ))  where w ( γ j,t ′ 0 ( t ′ ) , γ ∗ i,t 0 ( t )) = exp  − (ha v( γ j,t ′ 0 ( t ′ ) , γ ∗ i,t 0 ( t )) 2 2 r 2 S  and ha v( · , · ) denotes the hav ersine function, which w e use to translate distances b etw een geo co ordinates in to meters. 9 The cov ariates sp ecified for imputation b y radial av eraging are Altitude, Slop e, distance to parks, ND VI, and distance from home. In our application, we also set r S = 200 to a v erage measuremen ts within a 200 m. radius from the imputed geolo cations. 9 Since the factor ( √ 2 π r S ) − 1 app ears outside of the exp onent for all weigh ts and is constant for a set r S , w e omit it from w ( · , · ) for redundancy and to minimize the risk of floating p oint errors from making w ( · , · ) to o small. 32 Appendix C. Neural Network Backpr op a ga tion with the Coupling Flo w W e ha ve parametrized the neural net w ork f ϕ with the parameter v ector ϕ , but opt to discuss here sp ecifically what ϕ entails. In our implemen tation and setting, we sp ecify ϕ as the set of all the app ended w eigh ts of ev ery la yer that comprises the coupling flo w f ϕ . C.1. The Coupling Flo w. W e select the coupling flow for its structure and relative sim- plicit y: Each lay er consists of an affine coupling blo ck: a function that incorp orates affine transformations, i.e. a scaling and a translation [Dinh et al., 2017, Radev et al., 2022]. The sp ecific wa ys to structure an A CB differs b etw een con texts and dep ends on the use case. T o start, we define the following quantities related to the input v ector u : d := | u | , u 1 := u 1: ⌊ d/ 2 ⌋ , and u 2 := u ( ⌊ d/ 2 ⌋ +1): d The dual c oupling blo ck (DCB) utilized by [Radev et al., 2022] is then defined in terms of the following op erations: v 1 = u 2 ⊙ exp( g 1 ( u 1 )) + r 1 ( u 1 ) v 2 = u 1 ⊙ exp( g 2 ( v 1 )) + r 2 ( v 1 ) (24) where ⊙ denotes element-wise multiplication and g 1 , g 2 , r 1 , and r 2 are four separate fully- connected neural netw orks whose weigh ts w e will train. 10 Note that the inv erse of the DCB is also readily obtained in the reverse direction: u 1 = ( v 2 − r 2 ( v 1 )) ⊙ exp( − g 2 ( v 1 )) u 2 = ( v 1 − r 1 ( u 1 )) ⊙ exp( − g 1 ( u 1 )) (25) Due to the design of the DCB, none of g 1 , g 2 , r 1 , or r 2 ha v e to b e inv ertible themselves. W e call Equation (24) a ”dual coupling blo c k” for reasons that will b ecome apparent. It is generally easier to w ork with the single c oupling blo ck (SCB) of [Dinh et al., 2017]: v 1 = u 1 v 2 = u 2 ⊙ exp( g 1 ( u 1 )) + r 1 ( u 1 ) (26) Note that we can rewrite Equation (24) in terms of comp ositions of Equation (26): (27) f DCB = f 2 , SCB ◦ P ◦ f 1 , SCB , where P denotes the p ermutation matrix that swaps the tw o halves of the input vector. T aking the Jacobian of f 1 , SCB , which we choose without loss of generalit y , is simple: J f 1 , SCB = " I ⌊ d/ 2 ⌋ O ∂ v 2 ∂ u 1 diag(exp( g 1 ( u 1 ))) # (28) 10 [Radev et al., 2022] had u 1 and u 2 in Equation (24) swapped in their pap er. Equation (24) sp ecifies the order as is utilized by their soft ware pack age. 33 Its determinant can b e computed effortlessly: (29) det J f 1 , SCB = d −⌊ d/ 2 ⌋ Y i =1 exp( g 1 ( u 1 ) i ) , where g 1 ( u 1 ) i denotes the i th entry of g 1 ( u 1 ). It is clear that we can rewrite f ϕ in terms of comp ositions of DCB’s, and thus, SCB’s. Denote f 1 , ϕ 1 , . . . , f Λ , ϕ Λ as our Λ SCB’s, eac h parametrized by its o wn set of weigh ts ϕ k , k = 1 , . . . , Λ. Then: (30) f ϕ = f Λ , ϕ Λ ◦ P Λ − 1 ◦ f Λ − 1 , ϕ Λ − 1 ◦ · · · ◦ P 2 ◦ f 2 , ϕ 2 ◦ P 1 ◦ f 1 , ϕ 1 with the natural constraint that Λ must b e ev en when we lay er DCB’s, since eac h DCB consists of tw o SCB’s p er Equation (27). The p ermutation op erations are included to enable differen t com binations of the entries of the input to b e computed with one another; we add that P k for k even is c hosen to b e a p erm utation that exchanges tw o parts of its input in an orthogonal manner compared with the previous P k − 2 , P k − 4 , . . . P 2 p erm utation matrices, and that for k o dd, P k = P from Equation (27). In practice, w e sp ecify all g k and r k for f k, ϕ k , k = 1 , . . . , Λ, to b e single-lay er 128-length p erceptrons with ReLU activ ation functions, with an output la yer added at the end to enable the dimensions of the output to conform to the other half of the input vector (particularly relev an t when u is o dd). T aking g 1 as an example, (31) g 1 ( u 1 ) = W 1 o ReLU( W 1 u 1 ) , so that W 1 and W 1 o are flattened to form part of the ϕ 1 v ector, which is then incorp orated in to the entire ϕ . W 1 o is the final output matrix with d − ⌊ d/ 2 ⌋ rows to ensure that g 1 ( u 1 ) has the same length as u 2 ; the en tries of W 1 o are also trainable. 11 ReLU is defined as ReLU( x ) = max( x, 0) and is understo o d to apply entrywise where its argumen t is a vector or matrix. Its deriv ativ e is the Heaviside step function H( x ) = 1 x> 0 , i.e. 1 if its argumen t is p ositiv e and 0 otherwise. 12 Both for simplicity and b ecause this is how w e hav e utilized the comp onen t neural netw orks of the single and dual coupling flo ws in practice, we will assume all of g k and r k follo w the structure of g 1 in Equation (31) for all coupling flow lay ers of f ϕ . C.2. Bac kpropagation. W e b egin by discussing the ob jective function in Equation (6), where the argmin at the end of each training cycle is taken ov er the ϕ for the ob jective function to minimize it. F or clarity , we bring in the loss term to b e minimized at the end of 11 W 1 o has no activ ation function b y implemen tation and conv ention al usage of output lay ers. 12 The v alue of H(0) is sometimes explicitly set dep ending on the user or application domain, despite it not having a prop er v alue. 34 one training step of BayesFlow and simplify it in terms of the quantities in Equation (29): L ( f ϕ ; θ (1: M ) , y (1: M ) ) = 1 M M X m =1  1 2 || f ϕ ( θ ( m ) ; y ( m ) ) || 2 − log | det J ( m ) f ϕ |  = 1 M M X m =1  1 2 f ϕ ( θ ( m ) ; y ( m ) ) T f ϕ ( θ ( m ) ; y ( m ) ) − Λ X k =1 X i g k ( u k − 1 ) i ! (32) where u k − 1 denotes the output of P k − 1 ◦ f k − 1 , ϕ k − 1 ◦ · · · ◦ f 1 , ϕ 1 and g k the scaling neural net work analogous to g 1 in Equation (26), but sp ecific to the k th SCB la y er. W e a v oid sp ecifying the n um b er of indices that i ranges, as it will differ b etw een lay ers if θ ( m ) ’s length is o dd. In its most fundamental implemen tation, at the end of one training cycle, ϕ ← ϕ − α ∇ ϕ L ( f ϕ ; θ (1: M ) , y (1: M ) ), where α controls how far to allow the gradien t descent up date. In practice, the gradient term ∇ ϕ L ( f ϕ ; θ (1: M ) , y (1: M ) ) ma y b e replaced b y a more con v enient expression or algorithm. α is externally sp ecified and con trolled v arious parameters, man y from the user. As we will see, sp ecific strategies to con trol the gradient descent suc h as AD AM can be used that will substitute the gradien t term with other related quantities b efore up dating ϕ . W e defer to subsection C.3 for a more detailed discussion. Bac kpropagation is utilized to compute the gradien t term ∇ ϕ L ( f ϕ ; θ (1: M ) , y (1: M ) ) effi- cien tly without rep eating terms b et ween eac h lay er [W erbos, 1982, Rumelhart et al., 1986]. It generalizes across different mac hine learning arc hitectures and loss functions. Sp elling it out in terms of dense neural netw orks, a basic use case that motiv ates iterated up dates of w eigh ts p er lay er, is well-kno wn in the mac hine learning communit y [Bishop, 2006, Murphy, 2012]; we rep eat the calculations sp ecifically for the coupling flow for illustrativ e purp oses. W e b egin by taking a single term in the sum of Equation (32) and differentiating it with resp ect to a single w eight, say w kij , the w eight corresp onding to the i th row and j th column of the weigh ts W k corresp onding to the hidden lay er in g k : ∂ ∂ w kij 1 2 f ϕ ( θ ( m ) ; y ( m ) ) T f ϕ ( θ ( m ) ; y ( m ) ) − Λ X k =1 X i ′ g k ( u k − 1 ) i ′ ! = f ϕ ( θ ( m ) ; y ( m ) ) T ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ w kij − X i ′ ∂ g k ( u k − 1 ) i ′ ∂ w kij (33) W e note that w kij affects f ϕ only via g k . It is sensible, then, to define a k := W k u k − 1 and to compute each deriv ativ e with resp ect to a ki , as only the i th row of W k is relev ant to the 35 deriv ativ es: ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ w kij = ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ a ki da ki dw kij = ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ a ki u ( k − 1) j ∂ g k ( u k − 1 ) i ∂ w kij = w k ( o ) ii h ′ k ( a ki ) da ki dw kij = w k ( o ) ii h ′ k ( a ki ) u ( k − 1) j (34) where h k is the activ ation function corresp onding to g k , so that h k ( a k ) forms the output of the first la yer of the neural net work output of g k and w k ( o ) ii is the en try from the i th ro w and column of W ko , the final output matrix of g k . (Note that carrying out the differen tiation in the second line of Equation (34) with resp ect to g k ( u k − 1 ) i ′ for i ′  = i w ould yield zero.) W e ma y also decomp ose the partial deriv ativ e ∂ f ϕ /∂ a ki in terms of the functions at the output la y er W ko for k < Λ: ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ a ki = X i ′ ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ a k ( o ) i ′ ∂ a k ( o ) i ′ ∂ a ki = X i ′ ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ a k ( o ) i ′ w k ( o ) i ′ i h ′ k ( a ki ) (35) where a k ( o ) = W k ( o ) h k ( W k u k ) = g k ( u k ) and w k ( o ) i ′ i denotes the i ′ th row and i th column en try of W ko . W e also require deriv ative terms with resp ect to the en tries of W ko , b ecause it is a neural net w ork weigh t lay er that needs to b e trained as well. Some of its entries w ould b e pro cessed b y g k +1 and r k +1 and others by g k +2 and r k +2 b y virtue of the p ermutation matrices b etw een successiv e SCB’s. W e therefore adopt the notation of children for the index of the w eigh t matrices to formalize this dep endence structure for k < Λ − 1: c h[ k ] = { k ( o ) } , and c h[ k ( o )] = { k + 1 , ( k + 1)( r ) , k + 2 , ( k + 2)( r ) } . (36) While the weigh ts of r k are separate (and w e label the p ost-w eight multiplication ac- cordingly with a k ( r ) and a k ( r o ) ), Equation (36) applies analogously to the w eights of r k . Substituting k ( r ) and k ( ro ) into the resp ectiv e arguments of the tw o equations in Equation (36) k eeps their right-hand sides the same. Extending Equation (35) to the output lay ers of la y er k gives us: ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ a k ( o ) i = X c ∈ ch[ k ( o )] X i ′ ∂ f ϕ ( θ ( m ) ; y ( m ) ) ∂ a ci ′ w ci ′ i (37) What we hav e done with Equation (35) and (37) was to rewrite the deriv ative of f ϕ with resp ect to a k in terms of the deriv ativ e of f ϕ with resp ect to the next lay er after a k has b een passed through the activ ation function h k . This means that w e gather the v alue of the output and its deriv ativ es in one forward pass, and then compute the gradient with resp ect 36 to each of the w eigh ts by passing the v alues of the deriv atives backw ards, from the last lay er in the netw ork down to the first. Algorithm 7 demonstrates this pro cedure in full. Ov erall, the backpropagation pro cedure is not significantly different for the coupling flo w than it is for a regular feedforward net work. Besides the more complicated ob jective function b y virtue of its domain, the pro cedure whic h bac kpropagation is utilized to up date the w eigh ts ϕ remains largely the same b ecause the underlying neural netw ork still utilizes a forw ard pass b efore relev ant quantities and deriv ativ es are propagated backw ards. C.3. P arameter Update Con trol. Controlling the up date step is desired to av oid ex- plo ding or v anishing gradients, although to o m uch con trol may result in con vergence at lo cal minima instead of tending tow ards a global minima. W e utilize the ADAptiv e Momen t esti- mation, or ADAM , optimizer, sp ecifying the cosine decay strategy of [Loshchilo v and Hutter, 2017] to av oid ov ersho oting the optimal solution, and cap the norm of the gradien t at 1 to a v oid n umerical instabilit y from explo ding or v anishing gradients. W e detail ADAM [Kingma and Ba, 2017] in Algorithm 8. ADAM presents an alternative to standard batch gradient descen t, with adv antages in that it works with sparse gradien ts and naturally p erforms a form of step size annealing, or reducing the learning rate o ver time to impro v e conv ergence. Adopting ADAM as our alternativ e to gradien t descent would replace line 13 in Algorithm 1 with a sp ecification that ADAM sp ecifically is utilized; bac kpropagation is still used in b oth approaches. 13 Here, we skip the new generation of θ (1: M ) and y (1: M ) presen t in TRAIN BAYESFLOW Algorithm 1 for the sake of describing ADAM . W e write α as the function α () b ecause, in practice, the step size accoun ts for the computed gradien t of the loss function, the num b er of iterations, or other v ariables that control the parameter up date. These additional parameters are passed to α (), e.g. α ( ∇ ϕ f ϕ ), to enable it to control the step size of each up date. Appendix D. BayesFlow Implement a tion with the Summar y Network [Radev et al., 2022] implements the BayesFlow softw are pack age with the aid of another neural net w ork h ψ parametrized by ψ to condense their output into a summary statistic of fixed length, whic h is called the summary network . This is further enhanced b y allo wing their underlying mo del p ( y | θ ) to sample multiple instances N of the syn thetic outcome at once and allo wing it to differ b etw een distinct Mon te Carlo batches, so that h ψ ( y 1: N ) tak es in m ultiple samples of the data and learns a function to summarize the data even for differen t n um b ers of synthetic data instances. Owing to the particular structure of the actigraph data with its missingness resulting in tra jectories of differen t lengths and places in time, which w ould make constructing a summary netw ork on subsamples difficult to design, only N = 1 13 Other p opular optimizers include SGD, RMSprop, and Adagrad. The interested reader is encouraged to consult https://k eras.io.api/optimizers/ for more information. 37 Algorithm 7 One-step Bac kpropagation with the Coupling Flow 1: Input: Coupling Flow neural netw ork f ϕ 2: Loss L ( f ϕ ; θ (1: M ) , y (1: M ) ) 3: function BackProp CF ( f ϕ , L ( f ϕ ; θ (1: M ) , y (1: M ) )) 4: for m = 1 , . . . , M do 5: Compute f ϕ ( θ ( m ) ; y ( m ) ), and a k , a k ( o ) , a k ( r ) , and a k ( r o ) for k = 1 , . . . , Λ through the forward pass. 6: Compute the deriv ativ es of f ϕ with resp ect to the entr ies of W Λ( o ) , W Λ( ro ) , a Λ( o ) and a Λ( ro ) through Equation (37). 7: Compute the deriv atives of f ϕ with resp ect to the entries of W Λ , W Λ( r ) , a Λ and a Λ( r ) through Equation (35). 8: for k = Λ − 1 , . . . , 1 do 9: Compute the deriv ativ es of f ϕ with resp ect to the entries of W k ( o ) , W k ( r o ) , a k ( o ) and a k ( r o ) through Equation (37), using the deriv ativ es from lay er k + 1. 10: Compute the deriv ativ es of f k, ϕ k with resp ect to the entries of W k , W k ( r ) , a k and a k ( r ) through Equation (35), using the deriv ativ es from lay er k ( o ). 11: Rep eat line 9 for g k and r k , but only with W k ( o ) and a k ( o ) for g k , and W k ( r o ) and a k ( r o ) for r k . 12: Rep eat line 10 for g k and r k , but only with W k and a k for g k , and W k ( r ) and a k ( r ) for r k . 13: end for 14: end for 15: Gather the comp onent partial deriv ativ es of f ϕ , g k , and r k , with respect to the entries of the weigh t matrices for all k , to compute ∇ ϕ L ( f ϕ ; θ ( m ) , y ( m ) ). 16: Compute ∇ ϕ L ( f ϕ ; θ (1: M ) , y (1: M ) ) = P M m =1 ∇ ϕ L ( f ϕ ; θ ( m ) , y ( m ) ). 17: return ∇ ϕ L ( f ϕ ; θ (1: M ) , y (1: M ) ) 18: end function is relev ant to our particular case. Additionally , while our usage of ABI dep ends on Radev et al. [2022]’s pack age, we bypass the use of h ψ b y setting h ψ = id, the identit y transform. Still, there is practical justification for the use of the summary netw ork. Previous gener- ativ e mo dels such as v ariational auto enco ders hav e utilized neural netw orks to enco de data 38 Algorithm 8 AD Aptive Momen t estimation (ADAM) 1: Input: INN f ϕ , loss L ( f ϕ ; θ (1: M ) , y (1: M ) ), and initial parameter v ector ϕ . 2: Stepsize α (). ▷ [Kingma and Ba, 2017] recommend α () = 0 . 001 as a go o d default. 3: Exp onen tial deca y rates δ 1 , δ 2 ∈ [0 , 1) for momen t estimates. 4: ▷ δ 1 = 0 . 9 , δ 2 = 0 . 999 are considered go o d default settings. 5: Num b er of iterations N I T E R . 6: function ADAM ( α () , δ 1 , δ 2 , L ( f ϕ ; θ (1: M ) , y (1: M ) ) , ϕ ) 7: Initialize m 0 ← 0 , v 0 ← 0 , ϵ ← 10 − 8 . 8: for j = 1 , . . . , N I T E R do 9: g j ← BACKPROP CF ( f ϕ ; L ( f ϕ ; θ (1: M ) , y (1: M ) )) 10: m j ← δ 1 m j − 1 + (1 − δ 1 ) g j 11: v j ← δ 2 v j − 1 + (1 − δ 2 ) g 2 j ▷ g j is squared entrywise. 12: c m j ← m j / (1 − δ j 1 ) ▷ Correcting for bias for first and second moment estimates. 13: b v j ← v j / (1 − δ j 2 ) 14: ϕ ← ϕ − α () c m j / ( p b v j + ϵ ) ▷ Division is also en trywise. 15: end for 16: return ϕ 17: end function in to more compact represen tations [Kingma and W elling, 2013]. A less analytically tractable mo del such as the Lotk a-V olterra ma y b e b etter summarized b y a fitted though generally nonin terpretable means when the summary pro cedure itself is not of interest; Radev et al. demonstrate the sup erior parameter recov ery capabilities of a neural net work implemen ta- tion in their [Radev et al., 2022, Supplemen t]. Notably , in our applications of BayesFlow with a non trivial h ψ , we ha ve found that a relatively small sp eedup can b e measured 14 if a summary netw ork is emplo y ed for coarse blo cking arrangements of Algorithm 6, sp ecifically where the blo cks of y ( m ) T b − 1 : T b totaled 2,130 entries p er sample for T b − 1 = 1 to T b = 33 and the num b er of Mon te Carlo samples used was 1 , 024. In this setting, h ψ w as sp ecified as 14 Appro ximately 1.03 hours could b e sav ed, corresp onding to an approximate 8% sp eedup for the first training ep o ch where the corresp onding run time without the summary netw ork would hav e been close to 13 hours. 39 a three-la y er neural netw ork with a 1024-length dense lay er with a Rectified Linear Unit (ReLU) activ ation function, a drop out la yer to av oid ov erfitting, and a p ( T b − T b − 1 + 1) + 2- length dense la yer for batc h b with no activ ation function to summarize the statistics in to the num b er of parameters needed to fully express the mo del at eac h time step. The inclusion of the summary net work turns our problem in to inference of θ | h ψ ( y ), and the training goal inv olves finding the optimal v alues of ψ in addition to ϕ , with the expression on the left hand side of Equation 4 no w including the summary netw ork term: argmin ϕ , ψ E p ( y ) KL ( p ( θ | y ) || p ( f − 1 ϕ ( z ; h ψ ( y )))) (38) The change carries analogously to the ob jectiv e function (Equation (6)), which no w re- quires minimization ψ and incudes h ψ ( y ( m ) ) in place of y ( m ) : b ϕ , b ψ = argmin ϕ , ψ 1 M M X m =1  1 2 || f ϕ ( θ ( m ) ; h ψ ( y ( m ) )) || 2 − log | det J ( m ) f ϕ |  (39) Equation (39) is also incorp orated analogously into the pro cedure for Algorithm 1, with the addition of the summary net w ork h ψ and its related parameters and terms. 40

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment