The Extended Parameter Filter
The parameters of temporal models, such as dynamic Bayesian networks, may be modelled in a Bayesian context as static or atemporal variables that influence transition probabilities at every time step. Particle filters fail for models that include suc…
Authors: Yusuf Erol, Lei Li, Bharath Ramsundar
The Extended P arameter Filter Y usuf B. Erol † yberol@eecs.berkeley.edu Lei Li † leili@cs.berkeley.edu Bharath Ramsundar rbhara th@st anford.edu Computer Science Departmen t, Stanford Universit y Stuart Russell † russell@cs.berkeley.edu † EECS Departmen t, Universit y of California, Berkeley Abstract The parameters of temporal mo dels, such as dynamic Ba yesian net works, may b e mo d- elled in a Bay esian context as static or atem- p oral v ariables that influence transition prob- abilities at ev ery time step. P article filters fail for mo dels that include such v ariables, while metho ds that use Gibbs sampling of parameter v ariables ma y incur a per-sample cost that grows linearly with the length of the observ ation sequence. Storvik ( 2002 ) devised a metho d for incremental computation of ex- act sufficien t statistics that, for some cases, reduces the p er-sample cost to a constan t. In this pap er, we demonstrate a connection b et w een Storvik’s filter and a Kalman filter in parameter space and establish more gen- eral conditions under whic h Storvik’s filter w orks. Dra wing on an analogy to the ex- tended Kalman filter, w e develop and ana- lyze, both theoretically and exp erimen tally , a T aylor approximation to the parameter pos- terior that allows Storvik’s metho d to b e ap- plied to a broader class of mo dels. Our exper- imen ts on b oth syn thetic examples and real applications show improv ement o ver existing metho ds. 1. In tro duction Dynamic Bay esian netw orks are widely used to mo del the pro cesses underlying sequen tial data suc h as sp eec h signals, financial time series, genetic sequences, and medical or physiological signals. State estimation App earing in Pr o c e edings of the 30 th International Confer- enc e on Machine L e arning , A tlanta, Georgia, USA, 2013. Cop yright 2013 b y the author(s)/owner(s). 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 The Extended P arameter Filter Abstract The parameters of temp oral mo dels suc h as dynamic Ba y esian net w orks ma y b e view ed in the Ba y esian con text as static or atemp oral v ariables that influence the tran s it ion proba- bilities at ev ery time ste p . P article filters fail for mo dels that include suc h v ariables, while metho ds that use Gibbs sampling of param- eter v ariables ma y incur a p er-sample cost that gro ws linearly with the length of the ob- serv ation sequence. Storvik ( 2002 ) devised a metho d for incremen tal computation of ex- act sufficien t s tati s tics that, for some cases, reduces the p er-sample cost to a constan t. In this pap er, w e demonstrate a connection b e- t w ee n Storvik’s filter and a Kalman filter in parameter space and establish more general conditions under whic h it w orks. Dra wing on an analogy to the extended Kalman fil- ter, w e dev e lop and analyze, b oth theoret- ically and exp erimen tally , a T a ylor appro x- imation to the parameter p osterior that al- lo ws Storvik’s me th o d to b e applied to a broader class of mo dels. Our exp erimen ts on b oth syn thetic examples and real applica- tions sho w impro v emen t o v er existing meth- o ds. 1. In tro duction Dynamic Ba y esian net w orks are widely used to mo del the pro cess es underlying sequen tial data suc h as sp eec h signals, financial time series, genetic sequences, and medical or ph ysiological s ign als . State estimation or filtering—computing the p osterior distribution o v er the state of a partially observ able Mark o v pro cess from a se q ue n c e of observ ations—is one of the most widely studied problems in con trol theory , statistics and AI. Exact filtering is in tractable except for certain sp ecial cases (linear–Gaussian m o dels and di s crete HMMs), but appro ximate filtering using the p article filter (a se - Preliminary w ork. Under review b y the In ternatio nal Con- ference on Mac hine Learning (ICML). Do not distribute. θ X 1 Y 1 X 2 Y 2 X 3 Y 3 · · · X T Y T Figure 1. A state-space mo del with static parameters θ . X 1: T are laten t stat es, and Y 1: T observ ations. quen tial Mon te Carlo metho d) is feas i ble in man y real- w orld applications ( Arulampalam e t al. , 2002 ; Doucet and Johansen , 2008 ). In the mac hine learning con- text, m o del parameters ma y b e represen ted b y static parameter v ariab le s that define the transition and sen- sor mo del probabilities of the Mark o v pro cess, but do not themselv es c hange o v er time (Figure 1 ). The p os- terior parame ter distribution (usually) con v erges to a delta f unction at the true v alue in the limit of infinitely man y observ ations. Unf ortunately , particle filters fai l for suc h mo dels: the algorithm samples p arame ter v al- ues for eac h particle at time t = 0, bu t thes e remain fixed; o v er time, the particle resampling pro cess re- mo v es all but one set of v alues; and these are highly unlik ely to b e correct. The degeneracy problem is es- p ecially sev ere in high-dimensional parameter spaces, whether discrete or c on tin uous. Hence, alt hough learn- ing requires inference, the most successful inference al- gorithm for temp oral mo d e l s is inapplicable. Kan tas et al. ( 2009 ); Carv alho et al. ( 2010 ) describ e sev eral algori thms that ha v e b een prop osed to solv e this degeneracy problem, but the issue remains op en b ecause kno wn algorithms either suffer from bias or computational inefficiency . F or example, the “artifi- cial dynamics” ap proac h ( Liu and W est , 2001 ) in tro- duces a s to c hastic transition mo del for the parameter v ariables, allo win g exploration of parameter space, bu t this ma y result in biased estimates. Online EM algo- rithms ( Andrieu et al. , 2005 ) pro vide only p oin t esti- mates of static parameters, ma y con v erge to lo cal op- tima, and are biased unless used with th e full smo oth- ing distribution. The particle MCMC algorithm ( An- Figure 1. A state-space mo del with static parameters θ . X 1: T are latent states and Y 1: T are observ ations. or filtering—computing the p osterior distribution o ver the state of a partially observ able Mark ov pro cess from a sequence of observ ations—is one of the most widely studied problems in con trol theory , statistics and AI. Exact filtering is intractable except for certain sp ecial cases (linear–Gaussian mo dels and discrete HMMs), but appro ximate filtering using the p article filter (a se- quen tial Mon te Carlo method) is feasible in man y real- w orld applications ( Arulampalam et al. , 2002 ; Doucet and Johansen , 2011 ). In the machine learning con- text, mo del parameters ma y be represented b y static parameter v ariables that define the transition and sen- sor mo del probabilities of the Marko v pro cess, but do not themselves change ov er time (Figure 1 ). The p os- terior parameter distribution (usually) con verges to a delta function at the true v alue in the limit of infinitely man y observ ations. Unfortunately , particle filters fail for suc h models: the algorithm samples parameter v al- ues for each particle at time t = 0, but these remain fixed; o ver time, the particle resampling pro cess re- mo ves all but one set of v alues; and these are highly unlik ely to b e correct. The degeneracy problem is es- p ecially severe in high-dimensional parameter spaces, whether discrete or con tinuous. Hence, although learn- ing requires inference, the most successful inference al- gorithm for temp oral mo dels is inapplicable. Kan tas et al. ( 2009 ) and Carv alho et al. ( 2010 ) de- scrib e sev eral algorithms that hav e b een prop osed to The Extended P arameter Filter solv e this degeneracy problem, but the issue remains op en because known algorithms either suffer from bias or computational inefficiency . F or example, the “arti- ficial dynamics” approach ( Liu and W est , 2001 ) intro- duces a sto c hastic transition model for the parameter v ariables, allowing exploration of the parameter space, but this ma y result in biased estimates. Online EM al- gorithms ( Andrieu et al. , 2005 ) provide only p oin t es- timates of static parameters, ma y con verge to lo cal op- tima, and are biased unless used with the full smo oth- ing distribution. The particle MCMC algorithm ( An- drieu et al. , 2010 ) conv erges to the true p osterior, but requires computation growing with T , the length of the data sequence. The resample-mo ve algorithm ( Gilks and Berzuini , 2001 ) includes Gibbs sampling of parameter v ariables—that is, in Figure 1 , P ( θ | X 1 , . . . , X T ). This metho d requires O ( T ) computation per sample, leading Gilks and Berzuini to prop ose a sampling rate prop ortional to 1 /T to preserve constan t-time up dates. Storvik ( 2002 ) and P olson et al. ( 2008 ) observ e that a fixed-dimensional sufficien t statistic (if one exists) for θ can b e updated in constant time. Storvik describ es an algorithm for a sp ecific family of linear-in-parameters transition mo dels. W e show that Storvik’s algorithm is a sp ecial case of the Kalman filter in parameter space and identify a more general class of sep ar able systems to which the same approach can b e applied. By analogy with the extended Kalman filter, w e propose a new algorithm, the extende d p ar ameter filter (EPF), that computes a separable appro ximation to the parameter poste- rior and allows a fixed-dimensional (appro ximate) suf- ficien t statistic to b e maintained. The metho d is quite general: for example, with a p olynomial approxima- tion scheme such as T aylor expansion an y analytic p os- terior can b e handled. Section 2 briefly reviews particle filters and Storvik’s metho d and introduces our notion of separable mo dels. Section 3 describ es the EPF algorithm, and Section 4 discusses the details of a p olynomial approximation sc heme for arbitrary densities, whic h Section 4.2 then applies to estimate p osterior distributions of static pa- rameters. Section 5 pro vides empirical results compar- ing the EPF to other algorithms. All details of pro ofs are given in the app endix of the full v ersion ( Erol et al. , 2013 ). 2. Bac kground In this section, we review state-space dynamical mo d- els and the basic framework of appro ximate filtering algorithms. 2.1. State-space mo del and filtering Let Θ b e a parameter space for a partially observ able Mark ov process { X t } t ≥ 0 , { Y t } t ≥ 0 as sho wn in Figure 1 and defined as follo ws: X 0 ∼ p ( x 0 | θ ) (1) X t | x t − 1 ∼ p ( x t | x t − 1 , θ ) (2) Y t | x t ∼ p ( y t | x t , θ ) (3) Here the state v ariables X t are unobserv ed and the ob- serv ations Y t are assumed conditionally independent of other observ ations giv en X t . W e assume in this section that states X t , observ ations Y t , and parameters θ are real-v alued v ectors in d , m , and p dimensions resp ec- tiv ely . Here b oth the transition and sensor mo dels are parameterized by θ . F or simplicit y , w e will assume in the follo wing sections that only the transition model is parameterized by θ ; how ever, the results in this pap er can b e generalized to cov er sensor mo del parameters. The filtering density p ( x t | y 0: t , θ ) ob eys the following recursion: p ( x t | y 0: t , θ ) = p ( y t | x t , θ ) p ( x t | y 0: t − 1 , θ ) p ( y t | y 0: t − 1 , θ ) = p ( y t | x t , θ ) p ( y t | y 0: t − 1 , θ ) Z p ( x t − 1 | y 0: t − 1 , θ ) p ( x t | x t − 1 , θ ) dx t − 1 (4) where the up date steps for p ( x t | y 0: t − 1 , θ ) and p ( y t | y 0: t − 1 , θ ) in volv e the ev aluation of integrals that are not in general tractable. 2.2. Particle filtering With known parameters, particle filters can appro xi- mate the p osterior distribution ov er the hidden state X t b y a set of s ampl es. The canonical example is the sequen tial imp ortance sampling-resampling algorithm (SIR) (Algorithm 1 ). The SIR filter has v arious app ealing prop erties. It is mo dular, efficient, and easy to implement. The filter tak es constant time p er up date, regardless of time T , and as the n umber of particles N → ∞ , the empirical filtering density conv erges to the true marginal p oste- rior densit y under suitable assumptions. P article filters can accommo date unknown parame- ters by adding parameter v ariables in to the state vec- tor with an “identit y function” transition mo del. As noted in Section 1 this approach leads to degeneracy problems—esp ecially for high-dimensional parameter The Extended Parameter Filter Algorithm 1: Sequen tial imp ortance sampling- resampling (SIR) Input : N : num b er of particles; y 0 , . . . , y T : observ ation sequence Output : ¯ x 1: N 1: T initialize x i 0 ; for t = 1 , . . . , T do for i = 1 , . . . , N do sample x i t ∼ p ( x t | x i t − 1 ); w i t ← p ( y t | x i t ); sample 1 N , ¯ x i t ← Multinomial w i t , x i t ; x i t ← ¯ x i t ; spaces. T o ensure that some particle has initial pa- rameter v alues with bounded error, the num b er of par- ticles must grow exp onen tially with the dimension of the parameter space. 2.3. Storv ik’s algorithm T o a void the degeneracy problem, Storvik ( 2002 ) mo d- ifies the SIR algorithm by adding a Gibbs sampling step for θ conditioned on the state tra jectory in each particle (see Algorithm 2 ). The algorithm is devel- op ed in the SIS framework and consequently inherits the theoretical guaran tees of SIS. Storvik considers unkno wn parameters in the state evolution mo del and assumes a p erfectly known sensor mo del. His analysis can b e generalized to unknown sensor mo dels. Storvik’s approach b ecomes efficient in an on-line set- ting when a fixed-dimensional sufficien t statistic S t ex- ists for the static parameter ( i.e. , when p ( θ | x 0: t ) = p ( θ | S t ) holds). The important prop ert y of this algo- rithm is that the parameter v alue simulated at time t do es not dep end on the v alues simulated previously . This prop ert y preven ts the imp o verishmen t of the pa- rameter v alues in particles. One limitation of the algorithm is that it can only b e applied to mo dels with fixed-dimensional sufficient statistics. How ever, Storvik ( 2002 ) analyze the suffi- cien t statistics for a sp ecific family . Storvik ( 2002 ) shows how to obtain a sufficien t statis- tic in the con text of what he calls the Gaussian system pr o c ess , a transition mo del satisfying the equation x t = F T t θ + t , t ∼ N (0 , Q ) (5) where θ is the vector of unkno wn parameters with a prior of N ( θ 0 , C 0 ) and F t = F ( x t − 1 ) is a ma- trix where elemen ts are p ossibly nonlinear functions of x t − 1 . An arbitrary but known observ ation mo del Algorithm 2: Storvik’s filter. Input : N : num b er of particles; y 0 , . . . , y T : observ ation sequence Output : ¯ x 1: N 1: T , θ 1: N initialize x i 0 ; for t = 1 , . . . , T do for i = 1 , . . . , N do sample θ i ∼ p ( θ | x i 0: t − 1 ); sample x i t ∼ p ( x t | x i t − 1 , θ i ); w i ← p ( y t | x i t ); sample 1 N , ¯ x i t ← Multinomial w i t , x i t ; x i t ← ¯ x i t ; is assumed. Then the standard theory states that θ | x 0: t ∼ N ( m t , C t ) where the recursions for the mean and the co v ariance matrix are as follows: D t = F T t C t − 1 F t + Q C t = C t − 1 − C t − 1 F t D − 1 t F T t C t − 1 m t = m t − 1 + C t − 1 F t D − 1 t ( x t − F T t m t − 1 ) (6) Th us, m t and C t constitute a fixed-dimensional suffi- cien t statistic for θ . These up dates are in fact a sp ecial case of Kalman fil- tering applied to the p arameter space. Matching terms with the standard KF up date equations ( Kalman , 1960 ), w e find that the transition matrix for the KF is the identit y matrix, the transition noise co v ariance matrix is the zero matrix, the observ ation matrix for the KF is F t , and the observ ation noise cov ariance matrix is Q . This correspondence is of course what one would exp ect, since the true parameter v alues are fixed (i.e., an identit y transition). See the supplemen- tary material ( Erol et al. , 2013 ) for the deriv ation. 2.4. Separabil it y In this section, w e define a condition under whic h there exist efficien t up dates to parameters. Again, we fo cus on the state-space mo del as describ ed in Figure 1 and Equation ( 3 ). The mo del in Equation ( 3 ) can also b e expressed as x t = f θ ( x t − 1 ) + v t y t = g ( x t ) + w t (7) for some suitable f θ , g , v t , and w t . Definition 1. A system is separable if the tr ansi- tion function f θ ( x t − 1 ) c an b e written as f θ ( x t − 1 ) = l ( x t − 1 ) T h ( θ ) for some l ( · ) and h ( · ) and if the sto chas- tic i.i.d. noise v t has lo g-p olynomial density. The Extended Parameter Filter Theorem 1. F or a sep ar able system, ther e exist fixe d- dimensional sufficient statistics for the Gibbs density, p ( θ | x 0: T ) . The pro of is straightforw ard by the Fisher–Neyman factorization theorem; more details are giv en in the supplemen tary material of the full version ( Erol et al. , 2013 ). The Gaussian system pro cess mo dels defined in Equa- tion ( 5 ) are separable, since the transition func- tion F T t θ = ( F t ) T θ , but the prop ert y—and therefore Storvik’s algorithm—applies to a m uch broader class of systems. Moreov er, as w e now sho w, non-separable systems may in some cases b e well-appro ximated by separable systems, constructed by p olynomial density appro ximation steps applied to either the Gibbs dis- tribution p ( θ | x 0: t ) or to the transition mo del. 3. The extended parameter filter Let us consider the follo wing mo del. x t = f θ ( x t − 1 ) + v t ; v t ∼ N (0 , Σ) (8) where x ∈ R d , θ ∈ R p and f θ ( · ) : R d → R d is a vector- v alued function parameterized by θ . W e assume that the transition function f θ ma y b e non-separable. Our algorithm will create a p olynomial appro ximation to either the transition function or to the Gibbs distribu- tion, p ( θ | x 0: t ). T o illustrate, let us consider the transition mo del f θ ( x t − 1 ) = sin( θ x t − 1 ). It is apparen t that this tran- sition mo del is non-separable. If we approximate the transition function with a T a ylor series in θ centered around zero f θ ( x t − 1 ) ≈ ˆ f θ ( x t − 1 ) = x t − 1 θ − 1 3! x 3 t − 1 θ 3 + . . . (9) and use ˆ f as an approximate transition mo del, the sys- tem will b ecome separable. Then, Storvik’s filter can b e applied in constant time p er up date. This T aylor appro ximation leads to a log-polynomial densit y of the form of Equation ( 12 ). Our approach is analogous to that of the extended Kalman filter (EKF). EKF linearizes nonlinear tran- sitions around the current estimates of the mean and co v ariance and uses Kalman filter up dates for state estimation ( W elch and Bishop , 1995 ). Our prop osed algorithm, which w e call the extended parameter filter (EPF), approximates a non-separable system with a separable one, using a p olynomial approximation of some arbitrary order. This separable, approximate mo del is well-suited for Storvik’s filter and allows for Algorithm 3: Extended Parameter Filter Result : Approximate the Gibbs densit y p ( θ | x 0: t , y 0: t ) with the log-p olynomial densit y ˆ p ( θ | x 0: t , y 0: t ) Output : ˜ x 1 . . . ˜ x N initialize x i 0 and S i 0 ← 0; for t = 1 , . . . , T do for i = 1 , . . . , N do S i t = up date( S i t − 1 , x t − 1 ) ; // update statistics for polynomial approximation log ( ˆ p ( θ | ¯ x 0: t − 1 , y 0: t − 1 )) sample θ i ∼ ˆ p ( θ | ¯ x i 0: t − 1 , y 0: t − 1 ) = ˆ p ( θ | S i t ) ; sample x i t ∼ p ( x t | ¯ x i t − 1 , θ i ) ; w i ← p ( y t | x i t , θ i ); sample 1 N , ¯ x i t , ¯ S i t ← Multinomial w i t , x i t , S i t ; x i t , S i t ← ¯ x i t , ¯ S i t ; constan t time up dates to the Gibbs density of the pa- rameters. Although w e hav e describ ed an analogy to the EKF, it is imp ortan t to note that the EPF can effectively use higher-order approximations instead of just first-order linearizations as in EKF. In EKF, higher order appro x- imations lead to intractable integrals. The prediction in tegral for EKF p ( x t | y 0: t − 1 ) = Z p ( x t − 1 | y 0: t − 1 ) p ( x t | x t − 1 ) dx t − 1 can b e calculated for linear Gaussian transitions, in whic h case the mean and the co v ariance matrix are the track ed sufficien t statistic. How ever, in the case of quadratic transitions (or an y higher-order transitions), the ab o ve integral is no longer analytically tractable. In the case of EPF, the transition mo del is the identit y transition and hence the prediction step is trivial. The filtering recursion is p ( θ | x 0: t ) ∝ p ( x t | x t − 1 , θ ) p ( θ | x 0: t − 1 ) . (10) W e approximate the transition p ( x t | x t − 1 , θ ) with a log-p olynomial density ˆ p (log-p olynomial in θ ), so that the Gibbs densit y , which satisfies the recursions in equation 10 , has a fixed log-p olynomial structure at eac h time step. Due to the p olynomial structure, the appro ximate Gibbs density c an b e track ed in terms of its sufficient statistic (i.e., in terms of the co efficien ts of the p olynomial). The log-p olynomial structure is deriv ed in Section 4.2 . Pseudo-code for EPF is shown in Algorithm 3. Note that the approximated Gibbs density will be a log-multiv ariate p olynomial density of fixed order The Extended Parameter Filter (prop ortional to the order of the p olynomial approxi- mation). Sampling from suc h a densit y is not straigh t- forw ard but can be done b y Mon te Carlo sampling. W e suggest slice sampling ( Neal , 2003 ) or the Metrop olis- Hastings algorithm ( Robert and Casella , 2005 ) for this purpose. Although some approximate sampling sc heme is necessary , sampling from the approximated densit y remains a constant-time op eration when the dimension of ˆ p remains constan t. It is also imp ortan t to note that p erforming a p oly- nomial appro ximation for a p -dimensional parameter space may not b e an easy task. Ho wev er, we can re- duce the computational complexity of suc h appro xima- tions b y exploiting lo calit y prop erties. F or instance, if f θ ( · ) = h θ 1 ,...,θ p − 1 ( · ) + g θ p ( · ), where h is separable and g is non-separable, we only need to approximate g . In section 4 , we discuss the v alidity of the approxima- tion in terms of the KL-div ergence b et ween the true and approximate densities. In section 4.1 , we analyze the distance b et ween an arbitrary density and its ap- pro ximate form with resp ect to the order of the p oly- nomial. W e show that the distance go es to zero sup er- exp onential ly . Section 4.2 analyzes the error for the static parameter estimation problem and introduces the form of the log-p olynomial approximation. 4. Appro ximating the conditional distribution of parameters In this section, we construct approximate sufficient statistics for arbitrary one–dimensional state space mo dels. W e do so by exploiting log-p olynomial ap- pro ximations to arbitrary probability densities. W e pro ve that such approximations can b e made arbitrar- ily accurate. Then, we analyze the error in tro duced b y log-p olynomial appro ximation for the arbitrary one– dimensional mo del. 4.1. T a ylor approximation to an arbitrary densit y Let us assume a distribution p (kno wn only up to a normalization constant) expressed in the form p ( x ) ∝ exp( S ( x )), where S ( x ) is an analytic function on the supp ort of the distribution. In general we need a Mon te Carlo metho d to sample from this arbitrary densit y . In this section, we describ e an alternative, simpler sampling metho d. W e prop ose that with a p olynomial approximation P ( x ) (T aylor, Chebyshev etc.) of sufficien t order to the function S ( x ), we may sample from a distribution b p ∝ exp( P ( x )) with a sim- pler (i.e. log-polynomial) structure. W e show that the distance b et ween the distributions p and b p reduces to −2 −1 0 1 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 pdf real M=4 M=8 M=12 Figure 2. Appro ximated PDFs to the order M . 0 as the order of the appro ximation increases. The following theorem is based on T aylor approxima- tions; how ev er, the theorem can be generalized to han- dle an y p olynomial approximation scheme. The pro of is giv en in ( Erol et al. , 2013 ). Theorem 2. L et S ( x ) b e a M + 1 times differ en- tiable function with b ounde d derivatives, and let P ( x ) b e its M -th or der T aylor appr oximation. Then the KL- diver genc e b etwe en distributions p and ˆ p c onver ges to 0 , sup er-exp onential ly as the or der of appr oxima- tion M → ∞ . W e v alidate the T aylor appro ximation approac h for the log-densit y S ( x ) = − x 2 + 5 sin 2 ( x ). Figure 2 sho ws the result for this case. 4.2. Onli n e approximation of the Gibbs densit y of the parameter In our analysis, w e will assume the following mo del. x t = f θ ( x t − 1 ) + v t , v t ∼ N (0 , σ 2 ) y t = g ( x t ) + w t , w t ∼ N (0 , σ 2 o ) The p osterior distribution for the static parameter is p ( θ | x 0: T ) ∝ p ( θ ) T Y t =1 p ( x t | x t − 1 , θ ) . The product term, which requires linear time, is the b ottlenec k for this computation. A p olynomial ap- pro ximation to the transition function f θ ( · ) (the T ay- lor appro ximation around θ = 0) is: f θ ( x t − 1 ) = h ( x t − 1 , θ ) = M X i =0 1 i ! d i h ( x t − 1 , θ i ) dθ θ =0 | {z } H i ( x t − 1 ) θ i + R M ( θ ) = M X i =0 H i ( x t − 1 ) θ i + R M ( θ ) = ˆ f ( θ ) + R M ( θ ) where R M is the error for the M -dimensional T aylor appro ximation. W e define co efficien ts J i x t − 1 to satisfy P M i =0 H i ( x t − 1 ) θ i 2 = J 2 M x t − 1 θ 2 M + · · · + J 0 x t − 1 θ 0 . The Extended Parameter Filter Let ˆ p ( θ | x 0: T ) denote the appro ximation to p ( θ | x 0: T ) obtained by using the polynomial appro ximation to f θ in tro duced ab o ve. Theorem 3. ˆ p ( θ | x 0: T ) is in the exp onential family with the lo g-p olynomial density log p ( θ )+ (12) θ 1 . . . θ M θ M +1 . . . θ 2 M T | {z } T ( θ ) T . 1 σ 2 P T k =1 x k H 1 ( x k − 1 ) − 1 2 σ 2 P T k =1 J 1 x k − 1 . . . 1 σ 2 P T k =1 x k H M ( x k − 1 ) − 1 2 σ 2 P T k =1 J M x k − 1 − 1 2 σ 2 P T k =1 J M +1 x k − 1 . . . − 1 2 σ 2 P T k =1 J 2 M x k − 1 | {z } η ( x 0 ,...,x t ) The pro of is given in the supplementary material. This form has finite dimensional sufficien t statistics. Standard sampling from p ( θ | x 0: t ) requires O ( t ) time, whereas with the p olynomial approximation we can sample from this structured densit y of fixed dimension in constant time (giv en that sufficient statistics were trac ked). W e can furthermore prov e that sampling from this exp onen tial form appro ximation is asymp- totically correct. Theorem 4. L et p T ( θ | x 0: T ) denote the Gibbs distri- bution and b p T ( θ | x 0: T ) its or der M exp onential family appr oximation. Assume that p ar ameter θ has supp ort S θ and finite varianc e. Then as M → ∞ , T → ∞ , the KL diver genc e b etwe en p T and b p T go es to zer o. lim M ,T →∞ D K L ( p T || ˆ p T ) = 0 The proof is giv en in the supplemen tary material ( Erol et al. , 2013 ). Note that the analysis ab o ve can b e generalized to higher dimensional parameters. The one dimensional case is discussed for ease of exp osition. In the general case, an order M T aylor expansion for a p dimensional parameter vector θ will hav e M p terms. Then eac h up date of the sufficient statistics will cost O ( M p ) p er particle, p er time step, yielding the total complexit y O ( N T M p ). How ever, as noted b efore, we can often exploit the lo cal structure of f θ to sp eed up the update step. Notice that in either case, the up date cost p er time step is fixed (indep enden t of T ). 5. Exp erimen ts The algorithm is implemen ted for three sp ecific cases. Note that the mo dels discussed do not satisfy the Gaussian pro cess mo del assumption of Storvik ( 2002 ). 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 x 10 −3 θ pdf T=16 T=32 T=64 T=128 T=256 T=512 T=1024 Figure 3. Sinusoidal dynamical mo del (SIN). Shrink age of the Gibbs density p ( θ | x 0: T ) with resp ect to time duration T . Note that as T grows, the Gibbs density conv erges to the true parameter v alue. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.005 0.01 0.015 θ pdf Real M=1 M=3 M=5 M=7 M=9 (a) Approximating Gibbs 0 200 400 600 800 1000 −20 −15 −10 −5 0 5 T log(D KL ) M=1 M=3 M=5 M=7 M=9 (b) KL-divergence Figure 4. Sinusoidal dynamical mo del (SIN). (a) Conv er- gence of the approximate densities to the Gibbs density p ( θ | x 0:1024 ) with resp ect to the approximation order M ; (b) KL-divergence D K L ( p | ˆ p ) with resp ect to duration T and approximation order M . 5.1. Singl e parameter nonlinear mo del Consider the following mo del with sinusoid transition dynamics (SIN): x t = sin( θ x t − 1 ) + v t , v t ∼ N (0 , σ 2 ) y t = x t + w t , w t ∼ N (0 , σ 2 obs ) (13) where σ = 1, σ obs = 0 . 1 and the Gaussian prior for parameter θ is N (0 , 0 . 2 2 ). The observ ation sequence is generated by sampling from SIN with true parameter v alue θ = 0 . 7. Figure 3 sho ws how the Gibbs densit y p ( θ | x 0: t ) shrinks with resp ect to time, hence verifying identi- fiabilit y for this mo del. Notice that as T grows, the densities concen trate around the true parameter v alue. A T aylor appro ximation around θ = 0 has b een applied to the transition function sin( θ x t ). Figure 4(a) shows the approximate densities for different p olynomial or- ders for T = 1024. Notice that as the p olynomial order increases, the approximate densities con verge to the true densit y p ( θ | x 0:1024 ). The KL-div ergence D K L ( p || ˆ p ) for differen t p olyno- mial orders (N) and different data lengths (T) is illus- The Extended Parameter Filter 200 400 600 800 1000 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 time Mean of Particles mean deviation (a) Particle filter (SIR) 200 400 600 800 1000 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 time Mean of Particles mean deviation (b) Liu–W est filter 200 400 600 800 1000 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 time Mean of Particles mean deviation (c) EPF Figure 5. Sinusoidal dynamical mo del (SIN). (a) : Particle filter (SIR) with N = 50000 particles. Note the failure to con verge to the true v alue of parameter θ (0 . 7, shown as the blue line). (b) : Liu–W est filter with N = 50000 particles. (c) : EPF with N = 1000 particles and 7-th order approximation. Note b oth SIR and Liu–W est do not conv erge, while the EPF conv erges quic kly even with orders of magnitude fewer particles. trated in Figure 4(b) . The results are consistent with the theory dev elop ed in Section 4.1 . The degeneracy of a b ootstrap filter with N = 50000 particles can be seen from figure 5(a) . The Liu– W est approach with N = 50000 particles is shown in 5(b) . The p erturbation is θ t = ρθ t − 1 + (1 − ρ ) ¯ θ t − 1 + p 1 − ρ 2 std( θ t − 1 ) N (0 , 1), where ρ = 0 . 9. No- tice that even with N = 50000 particles and large p er- turbations, the Liu–W est approac h conv erges slowly compared to our metho d. F urthermore, for high- dimensional spaces, tuning the p erturbation param- eter ρ for Liu–W est b ecomes difficult. The EPF has been implemen ted on this mo del with N = 1000 particles with a 7-th order T a ylor ap- pro ximation to the p osterior. The time complexity is O ( N T ). The mean and the standard deviation of the particles are sho wn in figure 5(c) . 5.2. Cauch y dynamical system W e consider the following mo del. x t = ax t − 1 + Cauch y (0 , γ ) (14) y t = x t + N (0 , σ obs ) (15) Here Cauch y is the Cauch y distribution centered at 0 and with shap e parameter γ = 1. W e use a = 0 . 7, σ obs = 10, where the prior for the AR(1) parameter is N (0 , 0 . 2 2 ). This mo del represents autoregressiv e time ev olution with heavy-tailed noise. Such hea vy-tailed noises are observed in netw ork traffic data and click- stream data. The standard Cauch y distribution we use is f v ( v ; 0 , 1) = 1 π (1 + v 2 ) = exp − log ( π ) − log(1 + v 2 ) . W e appro ximate log(1 + v 2 ) by v 2 − v 4 / 2 + v 6 / 3 − v 8 / 4 + . . . (the T aylor approximation at 0). Figure 6(a) shows the simulated hidden s tate and the observ ations ( σ obs = 10). Notice that the simulated pro cess differs substantially from a s tan dard AR(1) pro cess due to the heavy-tailed noise. Storvik’s filter cannot handle this model since the necessary sufficient statistics do not exist. Figure 6(b) displa ys the mean v alue estimated by a b ootstrap filter with N = 50000 particles. As b efore the b o otstrap filter is unable to p erform meaningful inference. Figure 6(c) shows the performance of the Liu–W est filter with b oth N = 100 and N = 10000 particles. The Liu–W est filter do es not conv erge for N = 100 particles and conv erges slowly for N = 10000 particles. Figure 6(d) demonstrates the rapid conv er- gence of the EPF for only N = 100 particles with 10th order approximation. The time complexity is O ( N T ). Our empirical results confirm that the EPF prov es use- ful for mo dels with hea vy-tailed stochastic p erturba- tions. 5.3. Smo oth T ransition AR mo del The smo oth transition AR (ST AR) mo del is a smo oth generalization of the self-exciting threshold autoregres- siv e (SET AR) mo del, ( v an Dijk et al. , 2002 ). It is generally expressed in the follo wing form. x t = ( a 1 x t − 1 + a 2 x t − 2 + · · · + a p x t − p ) [1 − G ( x t − d ; γ , c )] + ( b 1 x t − 1 + b 2 x t − 2 + · · · + b p x t − p ) [ G ( x t − d ; γ , c )] + t where t is i.i.d. Gaussian with mean zero and v ariance σ 2 and G ( · ) is a nonlinear function of x t − d , where d > 0. W e will use the logistic function G ( y t − d ; γ , c ) = 1 1 + exp ( − γ ( x t − d − c )) (16) F or high γ v alues, the logistic function con verges to the indicator function, I ( x t − d > c ), forcing ST AR to The Extended Parameter Filter 10 20 30 40 50 60 70 80 90 100 −300 −250 −200 −150 −100 −50 0 50 time Hidden State State Observation (a) Data sequence 20 40 60 80 100 −1.5 −1 −0.5 0 0.5 1 1.5 time Mean of Particles mean deviation (b) Particle filter (SIR) 20 40 60 80 100 −1.5 −1 −0.5 0 0.5 1 1.5 2 time Mean of Particles mean −100 particles deviation −100 particles mean −10000 particles deviation −10000 particles (c) Liu–W est filter 20 40 60 80 100 −1.5 −1 −0.5 0 0.5 1 1.5 time Mean of Particles mean deviation (d) EPF Figure 6. Cauch y dynamical system. (a) : Example sequences for hidden states and observ ations. (b) : Particle filter estimate with 50000 particles. (c) : Liu–W est filter with 100 and 10000 particles. (d) : EPF using only 100 particles and 10th order approximation. Note EPF conv erges to the actual v alue of parameter a (=0.7, in blue line) while SIR do es not even with orders of magnitude more particles, neither do es Liu–W est with the same num b er of particles. γ c 0 1 2 3 4 5 0 1 2 3 4 5 T=100 T=250 T=500 T=1000 (a) Gibbs densit y 500 1000 1500 2000 2500 −1 0 1 2 3 time Mean of Particles γ −mean γ −deviation c −mean c −deviation (b) Liu–W est filter 500 1000 1500 2000 2500 −1 0 1 2 3 time Mean of Particles γ −mean γ −deviation c −mean c−deviation (c) EPF Figure 7. ST AR mo del. (a) : Shrink age of the Gibbs density p ( γ , c | x 0: t ) with resp ect to time. (b) : Liu–W est filter using 50000 particles. (c) : EPF using 100 particles and 9th order approximation. Note the EPF’s estimates for both parameters con verge to the actual v alues q uic kly even with only 100 particles, while Liu–W est do es not conv erge at all. con verge to SET AR (SET AR corresp onds to a switch- ing linear–Gaussian system). W e will use p = 1 = d , where a 1 = 0 . 9 and b 1 = 0 . 1 and σ = 1 (corresp ond- ing to t w o different AR(1) processes with high and lo w memory). W e attempt to estimate parameters γ , c of the logistic function, which ha v e true v alues γ = 1 and c = 3. Data (of length T = 1000) is generated from the mo del under fixed parameter v alues and with ob- serv ation mo del y t = x t + w t , where w t is additive Gaussian noise with mean zero and standard devia- tion σ obs = 0 . 1. Figure 7(a) sho ws the shrink age of the Gibbs densit y p ( γ , c | x 0: T ), verifying identifiabil- it y . The non-separable logistic term is appro ximated as 1 1 + exp ( − γ ( x t − 1 − c )) ≈ 1 2 − 1 4 γ ( c − x t − 1 ) + 1 48 γ 3 ( c − x t − 1 ) 3 + . . . Figure 7(b) displays the failure of the Liu–W est fil- ter for N = 50000 particles. Figure 7(c) sho ws the mean v alues for γ , c from EPF for only N = 100 par- ticles with 9th order T aylor approximation. Sampling from the log-p olynomial approximate density is done through the random-walk Metrop olis–Hastings algo- rithm. F or each particle path, at each time step t , the Metrop olis–Hastings sampler is initialized from the pa- rameter v alues at t − 1. The burn-in p erio d is set to b e 0, so only one MH step is tak en p er time step (i.e., if a prop osed sample is more likely it is accepted, else it is rejected with a sp ecific probability). The whole filter has time complexit y O ( N T ). 6. Conclusion Learning the parameters of temporal probability mod- els remains a significant op en problem for practical applications. W e hav e prop osed the extended parame- ter filter (EPF), a nov el approximate inference algo- rithm that com bines Gibbs sampling of parameters with computation of appro ximate sufficient statistics. The up date time for EPF is indep enden t of the length of the observ ation sequence. Moreov er, the algorithm has prov able error b ounds and handles a wide v ariety of mo dels. Our exp erimen ts confirm these prop erties and illustrate difficult cases on which EPF works w ell. One limitation of our algorithm is the complexity of T aylor approximation for high-dimensional parameter v ectors. W e noted that, in some cases, the pro cess can b e decomp osed into low er-dimensional subproblems. Automating this step w ould b e b eneficial. The Extended Parameter Filter References C. Andrieu, A. Doucet, and V. T adic. On-line param- eter estimation in general state-space mo dels. In Pr o c e e dings of the 44th Confer enc e on De cision and Contr ol , pages 332–337, 2005. Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. P article Mark ov chain Monte Carlo metho ds. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 72(3):269–342, 2010. Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle filters for on-line non-linear/non-Gaussian Bay esian tracking. IEEE T r ansactions on Signal Pr o c essing , 50(2):174– 188, 2002. Carlos M. Carv alho, Michael S. Johannes, Hedib ert F. Lop es, and Nic holas G. Polson. P article Learning and Smo othing. Statistic al Scienc e , 25:88–106, 2010. doi: 10.1214/10- STS325. Arnaud Doucet and Adam M. Johansen. A tutorial on particle filtering and smo othing: fifteen years later. The Oxfor d Handb o ok of Nonline ar Filtering , pages 4–6, Decem b er 2011. Y usuf Erol, Lei Li, Bharath Ramsundar, and Stu- art J. Russell. The extended parameter fil- ter. T ec hnical Rep ort UCB/EECS-2013-48, EECS Departmen t, Univ ersity of California, Berk eley , Ma y 2013. URL http://www.eecs.berkeley.edu/ Pubs/TechRpts/2013/EECS- 2013- 48.html . W alter R. Gilks and Carlo Berzuini. F ollowing a mo ving target – Monte Carlo inference for dynamic ba yesian mo dels. Journal of the R oyal Statistic al So ciety. Series B (Statistic al Metho dolo gy) , 63(1): 127–146, 2001. Rudolf E. Kalman. A new approach to linear filtering and prediction problems. T r ansactions of the ASME – Journal of Basic Engine ering , 82 (Series D):35–45, 1960. Nic holas Kantas, Arnaud Doucet, Sumeetpal Sindhu Singh, and Jan Maciejo wski. An o verview of sequen- tial Monte Carlo metho ds for parameter estimation in general state-space mo dels. In 15th IF AC Sym- p osium on System Identific ation , volume 15, pages 774–785, 2009. Jane Liu and Mike W est. Combined parameter and state estimation in sim ulation-based filtering. In Se- quential Monte Carlo Metho ds in Pr actic e . 2001. Radford M. Neal. Slice sampling. Annals of Statistics , 31(3):705–767, 2003. Nic holas G. P olson, Jonathan R. Stroud, and Peter M ¨ uller. Practical filtering with sequential parame- ter learning. Journal of the R oyal Statistic al So ci- ety: Series B (Statistic al Metho dolo gy) , 70(2):413– 428, 2008. Christian P . Rob ert and George Casella. Monte Carlo Statistic al Metho ds . Springer-V erlag New Y ork, Inc., Secaucus, NJ, USA, 2005. Geir Storvik. Particle filters for state-space mo dels with the presence of unknown static paramaters. IEEE T r ansactions on Signal Pr o c essing , 50(2):281– 289, 2002. Dic k v an Dijk, Timo T ersvirta, and Philip Hans F ranses. Smo oth transition autoregressive mo dels – a survey of recent dev elopments. Ec onometric R e- views , 21:1–47, 2002. Greg W elc h and Gary Bishop. An in tro duction to the Kalman filter, 1995. The Extended P arameter Filter App endices A. Storvik’s filter as a Kalman filter Let us consider the follo wing mo del. x t = A x t − 1 + v t , v t ∼ N (0 , Q ) y t = H x t + w t , w t ∼ N (0 , R ) (17) W e will call the MMSE estimate Kalman filter returns as x t | t = E [ x t | y 0: t ] and the v ariance P t | t = cov ( x t | y 0: t ). Then the up date for the conditional mean esti- mate is as follo ws. x t | t = A x t − 1 | t − 1 + P t | t − 1 H T ( HP t | t − 1 H T + R ) − 1 | {z } K t ( y t − HA x t − 1 | t − 1 ) where as for the estimation co v ariance P t | t − 1 = AP t − 1 | t − 1 A T + Q P t | t = ( I − K t H ) P t | t − 1 (18) Matc hing the terms ab o ve to the up dates in equa- tion 6 , one will obtain a linear mo del for which the transition matrix is A = I , the observ ation matrix is H = F t , the state noise cov ariance matrix is Q = 0 , and the observ ation noise co v ariance matrix is R = Q B. Pro of of theorem 1 Let us assume that x ∈ R d , θ ∈ R p and f θ ( · ) : R d → R d is a v ector v alued function parameterized b y θ . More- o ver, due to the assumption of separabilit y f θ ( x t − 1 ) = l ( x t − 1 ) T h ( θ ), where we as sume that l ( · ) : R d → R m × d and h ( · ) : R p → R m and m is an arbitrary con- stan t. The stochastic p erturbance will ha ve the log- p olynomial density p ( v t ) ∝ exp( Λ 1 v t + v T t Λ 2 v t + . . . ) Let us analyze the case of p ( v t ) ∝ exp( Λ 1 v t + v T t Λ 2 v t ), for mathematical simplicit y . Pr o of. log p ( θ | x 0: T ) ∝ log p ( θ ) + T X t =1 log p ( x t | x t − 1 , θ ) ∝ log p ( θ ) + T X t =1 Λ 1 x t − l ( x t − 1 ) T h ( θ ) + x t − l ( x t − 1 ) T h ( θ ) T Λ 2 x t − l ( x t − 1 ) T h ( θ ) ∝ log p ( θ ) + T X t =1 − ( Λ 1 + 2 x T t Λ 2 ) l ( x t − 1 ) T ! | {z } S 1 h ( θ ) + h T ( θ ) T X t =1 l ( x t − 1 ) Λ 2 l T ( x t − 1 ) ! | {z } S 2 h ( θ ) + constan ts Therefore, sufficient statistics ( S 1 ∈ R 1 × m and S 2 ∈ R m × m ) exist. The analysis can b e generalized for higher-order terms in v t in similar fashion. C. Pro of of theorem 2 Prop osition 1. L et S ( x ) b e a M + 1 times differ en- tiable function and P ( x ) its or der M T aylor appr ox- imation. L et I = ( x − a, x + a ) b e an op en interval ar ound x . L et R ( x ) b e the r emainder function, so that S ( x ) = P ( x ) + R ( x ) . Supp ose ther e exists c onstant U such that ∀ y ∈ I , f ( k +1) ( y ) ≤ U We may then b ound ∀ y ∈ I , | R ( y ) | ≤ U a M +1 ( M + 1)! W e define the following terms = U a M +1 ( M + 1)! Z = Z I exp( S ( x )) dx ˆ Z = Z I exp( P ( x )) dx The Extended Parameter Filter Since exp( · ) is monotone and increasing and | S ( x ) − P ( x ) | ≤ , we can deriv e tight b ounds relating Z and b Z . Z = Z I exp( S ( x )) dx ≤ Z I exp( P ( x ) + ) dx = ˆ Z exp( ) Z = Z I exp( S ( x )) dx ≥ Z I exp( P ( x ) − ) dx = ˆ Z exp( − ) Pr o of. D K L ( p || ˆ p ) = Z I ln p ( x ) ˆ p ( x ) p ( x ) dx = Z I S ( x ) − P ( x ) + ln( ˆ Z ) − ln( Z ) p ( x ) dx ≤ Z I | S ( x ) − P ( x ) | p ( x ) dx + Z I ln( ˆ Z ) − ln( Z ) p ( x ) dx ≤ 2 ∝ a M +1 ( M + 1)! ≈ 1 p 2 π ( M + 1)! ae M + 1 M +1 where the last appro ximation follo ws from Stirling’s appro ximation. Therefore, D K L ( p || ˆ p ) → 0 as M → ∞ . D. Pro of of theorem 3 Pr o of. log ˆ p ( θ | x 0: T ) = log p ( θ ) T Y k =0 ˆ p ( x k | x k − 1 , θ ) ! = log p ( θ ) + T X k =0 log ˆ p ( x k | x k − 1 , θ ) W e can calculate the form of log ˆ p ( x k | x k − 1 , θ ) explic- itly . log ˆ p ( x k | x k − 1 , θ ) = log N ( ˆ f ( x k − 1 , θ ) , σ 2 ) = − log ( σ √ 2 π ) − ( x k − ˆ f ( x k − 1 , θ )) 2 2 σ 2 = − log ( σ √ 2 π ) − x 2 k − 2 x k ˆ f ( x k − 1 , θ ) + ˆ f ( x k − 1 , θ ) 2 2 σ 2 = − log ( σ √ 2 π ) − x 2 k 2 σ 2 − P M i =0 x k H i ( x k − 1 ) θ i σ 2 + P 2 M i =0 J i x k − 1 θ i 2 σ 2 Using this expansion, w e calculate log ˆ p ( θ | x 0: T ) = log p ( θ ) + T X k =0 log ˆ p ( x k | x k − 1 , θ ) = log p ( θ ) − ( T + 1) log ( σ √ 2 π ) − 1 2 σ 2 T X k =0 x 2 k ! − T ( θ ) T η ( x 0 , . . . , x T ) where we expand T ( θ ) T η ( x 0 , . . . , x T ) as in 3 . The form for log ˆ p ( θ | x 0: T ) is in the exp onen tial family . E. Pro of of theorem 4 Pr o of. Assume that function f has b ounded deriv a- tiv es and b ounded supp ort I . Then the maximum error satisfies f θ ( x k − 1 ) − ˆ f θ ( x k − 1 ) ≤ k . It follo ws that ˆ f θ ( x k − 1 ) 2 − f θ ( x k − 1 ) 2 = − 2 k − 2 ˆ f θ ( x k − 1 ) k ≈ − 2 ˆ f θ ( x k − 1 ) k . Then the KL-div ergence betw een the real posterior and the approximated p osterior satisfies the following form ula. D K L ( p T || ˆ p T ) (19) = Z S θ 1 σ 2 T X k =1 k ( x k − ˆ f θ ( x k − 1 )) ! p T ( θ | x 0: T ) dθ Moreo ver, recall that as T → ∞ the p osterior shrinks to δ ( θ − θ ∗ ) b y the assumption of iden tifiability . Then w e can rewrite the KL-divergence as (assuming T aylor appro ximation centered around θ c ) lim T →∞ D K L ( p T || ˆ p T ) (20) = 1 σ 2 lim T →∞ T X k =1 k Z S θ ( x k − ˆ f θ ( x k − 1 )) p T ( θ | x 0: T ) dθ = 1 σ 2 lim T →∞ T X k =1 k · (21) x k − M X i =0 H i ( x k − 1 ) Z S θ ( θ − θ c ) i p ( θ | x 0: T ) dθ ! = 1 σ 2 lim T →∞ T X k =1 k x k − M X i =0 H i ( x k − 1 )( θ ∗ − θ c ) i ! If the cen ter of the T aylor approximation θ c is the true parameter v alue θ ∗ , w e can show that lim T →∞ D K L ( p T || ˆ p T ) = 1 σ 2 lim T →∞ T X k =1 k ( x k − f θ ∗ ( x k − 1 ))) = 1 σ 2 lim T →∞ T X k =1 k v k = 0 (22) The Extended Parameter Filter where the final statement follows from law of large n umbers. Thus, as T → ∞ , the T aylor approximation of any order will conv erge to the true p osterior given that θ c = θ ∗ . F or an arbitrary center v alue θ c , D K L ( p T || ˆ p T ) = 1 σ 2 T X k =1 k x k − M X i =0 H i ( x k − 1 )( θ ∗ − θ c ) i ! (23) Notice that k ∝ 1 ( M +1)! (b y our assumptions that f has b ounded deriv ative and is supp orted on interv al I ) and H i ( · ) ∝ 1 M ! . The inner summation will b e b ounded since M ! > a M , ∀ a ∈ R as M → ∞ . There- fore, as M → ∞ , D K L ( p || ˆ p ) → 0.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment