The Extended Parameter Filter

The Extended P arameter Filter Y usuf B. Erol † yberol@eecs.berkeley.edu Lei Li † leili@cs.berkeley.edu Bharath Ramsundar rbhara th@st anford.edu Computer Science Departmen t, Stanford Universit y Stuart Russell † russell@cs.berkeley.edu † EECS Departmen t, Universit y of California, Berkeley Abstract The parameters of temporal mo dels, such as dynamic Ba yesian net works, may b e mo d- elled in a Bay esian context as static or atem- p oral v ariables that inﬂuence transition prob- abilities at ev ery time step. P article ﬁlters fail for mo dels that include such v ariables, while metho ds that use Gibbs sampling of parameter v ariables ma y incur a per-sample cost that grows linearly with the length of the observ ation sequence. Storvik ( 2002 ) devised a metho d for incremental computation of ex- act suﬃcien t statistics that, for some cases, reduces the p er-sample cost to a constan t. In this pap er, we demonstrate a connection b et w een Storvik’s ﬁlter and a Kalman ﬁlter in parameter space and establish more gen- eral conditions under whic h Storvik’s ﬁlter w orks. Dra wing on an analogy to the ex- tended Kalman ﬁlter, w e develop and ana- lyze, both theoretically and exp erimen tally , a T aylor approximation to the parameter pos- terior that allows Storvik’s metho d to b e ap- plied to a broader class of mo dels. Our exper- imen ts on b oth syn thetic examples and real applications show improv ement o ver existing metho ds. 1. In tro duction Dynamic Bay esian netw orks are widely used to mo del the pro cesses underlying sequen tial data suc h as sp eec h signals, ﬁnancial time series, genetic sequences, and medical or physiological signals. State estimation App earing in Pr o c e edings of the 30 th International Confer- enc e on Machine L e arning , A tlanta, Georgia, USA, 2013. Cop yright 2013 b y the author(s)/owner(s). 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 The Extended P arameter Filter Abstract The parameters of temp oral mo dels suc h as dynamic Ba y esian net w orks ma y b e view ed in the Ba y esian con text as static or atemp oral v ariables that inﬂuence the tran s it ion proba- bilities at ev ery time ste p . P article ﬁlters fail for mo dels that include suc h v ariables, while metho ds that use Gibbs sampling of param- eter v ariables ma y incur a p er-sample cost that gro ws linearly with the length of the ob- serv ation sequence. Storvik ( 2002 ) devised a metho d for incremen tal computation of ex- act suﬃcien t s tati s tics that, for some cases, reduces the p er-sample cost to a constan t. In this pap er, w e demonstrate a connection b e- t w ee n Storvik’s ﬁlter and a Kalman ﬁlter in parameter space and establish more general conditions under whic h it w orks. Dra wing on an analogy to the extended Kalman ﬁl- ter, w e dev e lop and analyze, b oth theoret- ically and exp erimen tally , a T a ylor appro x- imation to the parameter p osterior that al- lo ws Storvik’s me th o d to b e applied to a broader class of mo dels. Our exp erimen ts on b oth syn thetic examples and real applica- tions sho w impro v emen t o v er existing meth- o ds. 1. In tro duction Dynamic Ba y esian net w orks are widely used to mo del the pro cess es underlying sequen tial data suc h as sp eec h signals, ﬁnancial time series, genetic sequences, and medical or ph ysiological s ign als . State estimation or ﬁltering—computing the p osterior distribution o v er the state of a partially observ able Mark o v pro cess from a se q ue n c e of observ ations—is one of the most widely studied problems in con trol theory , statistics and AI. Exact ﬁltering is in tractable except for certain sp ecial cases (linear–Gaussian m o dels and di s crete HMMs), but appro ximate ﬁltering using the p article ﬁlter (a se - Preliminary w ork. Under review b y the In ternatio nal Con- ference on Mac hine Learning (ICML). Do not distribute. θ X 1 Y 1 X 2 Y 2 X 3 Y 3 · · · X T Y T Figure 1. A state-space mo del with static parameters θ . X 1: T are laten t stat es, and Y 1: T observ ations. quen tial Mon te Carlo metho d) is feas i ble in man y real- w orld applications ( Arulampalam e t al. , 2002 ; Doucet and Johansen , 2008 ). In the mac hine learning con- text, m o del parameters ma y b e represen ted b y static parameter v ariab le s that deﬁne the transition and sen- sor mo del probabilities of the Mark o v pro cess, but do not themselv es c hange o v er time (Figure 1 ). The p os- terior parame ter distribution (usually) con v erges to a delta f unction at the true v alue in the limit of inﬁnitely man y observ ations. Unf ortunately , particle ﬁlters fai l for suc h mo dels: the algorithm samples p arame ter v al- ues for eac h particle at time t = 0, bu t thes e remain ﬁxed; o v er time, the particle resampling pro cess re- mo v es all but one set of v alues; and these are highly unlik ely to b e correct. The degeneracy problem is es- p ecially sev ere in high-dimensional parameter spaces, whether discrete or c on tin uous. Hence, alt hough learn- ing requires inference, the most successful inference al- gorithm for temp oral mo d e l s is inapplicable. Kan tas et al. ( 2009 ); Carv alho et al. ( 2010 ) describ e sev eral algori thms that ha v e b een prop osed to solv e this degeneracy problem, but the issue remains op en b ecause kno wn algorithms either suﬀer from bias or computational ineﬃciency . F or example, the “artiﬁ- cial dynamics” ap proac h ( Liu and W est , 2001 ) in tro- duces a s to c hastic transition mo del for the parameter v ariables, allo win g exploration of parameter space, bu t this ma y result in biased estimates. Online EM algo- rithms ( Andrieu et al. , 2005 ) pro vide only p oin t esti- mates of static parameters, ma y con v erge to lo cal op- tima, and are biased unless used with th e full smo oth- ing distribution. The particle MCMC algorithm ( An- Figure 1. A state-space mo del with static parameters θ . X 1: T are latent states and Y 1: T are observ ations. or ﬁltering—computing the p osterior distribution o ver the state of a partially observ able Mark ov pro cess from a sequence of observ ations—is one of the most widely studied problems in con trol theory , statistics and AI. Exact ﬁltering is intractable except for certain sp ecial cases (linear–Gaussian mo dels and discrete HMMs), but appro ximate ﬁltering using the p article ﬁlter (a se- quen tial Mon te Carlo method) is feasible in man y real- w orld applications ( Arulampalam et al. , 2002 ; Doucet and Johansen , 2011 ). In the machine learning con- text, mo del parameters ma y be represented b y static parameter v ariables that deﬁne the transition and sen- sor mo del probabilities of the Marko v pro cess, but do not themselves change ov er time (Figure 1 ). The p os- terior parameter distribution (usually) con verges to a delta function at the true v alue in the limit of inﬁnitely man y observ ations. Unfortunately , particle ﬁlters fail for suc h models: the algorithm samples parameter v al- ues for each particle at time t = 0, but these remain ﬁxed; o ver time, the particle resampling pro cess re- mo ves all but one set of v alues; and these are highly unlik ely to b e correct. The degeneracy problem is es- p ecially severe in high-dimensional parameter spaces, whether discrete or con tinuous. Hence, although learn- ing requires inference, the most successful inference al- gorithm for temp oral mo dels is inapplicable. Kan tas et al. ( 2009 ) and Carv alho et al. ( 2010 ) de- scrib e sev eral algorithms that hav e b een prop osed to The Extended P arameter Filter solv e this degeneracy problem, but the issue remains op en because known algorithms either suﬀer from bias or computational ineﬃciency . F or example, the “arti- ﬁcial dynamics” approach ( Liu and W est , 2001 ) intro- duces a sto c hastic transition model for the parameter v ariables, allowing exploration of the parameter space, but this ma y result in biased estimates. Online EM al- gorithms ( Andrieu et al. , 2005 ) provide only p oin t es- timates of static parameters, ma y con verge to lo cal op- tima, and are biased unless used with the full smo oth- ing distribution. The particle MCMC algorithm ( An- drieu et al. , 2010 ) conv erges to the true p osterior, but requires computation growing with T , the length of the data sequence. The resample-mo ve algorithm ( Gilks and Berzuini , 2001 ) includes Gibbs sampling of parameter v ariables—that is, in Figure 1 , P ( θ | X 1 , . . . , X T ). This metho d requires O ( T ) computation per sample, leading Gilks and Berzuini to prop ose a sampling rate prop ortional to 1 /T to preserve constan t-time up dates. Storvik ( 2002 ) and P olson et al. ( 2008 ) observ e that a ﬁxed-dimensional suﬃcien t statistic (if one exists) for θ can b e updated in constant time. Storvik describ es an algorithm for a sp eciﬁc family of linear-in-parameters transition mo dels. W e show that Storvik’s algorithm is a sp ecial case of the Kalman ﬁlter in parameter space and identify a more general class of sep ar able systems to which the same approach can b e applied. By analogy with the extended Kalman ﬁlter, w e propose a new algorithm, the extende d p ar ameter ﬁlter (EPF), that computes a separable appro ximation to the parameter poste- rior and allows a ﬁxed-dimensional (appro ximate) suf- ﬁcien t statistic to b e maintained. The metho d is quite general: for example, with a p olynomial approxima- tion scheme such as T aylor expansion an y analytic p os- terior can b e handled. Section 2 brieﬂy reviews particle ﬁlters and Storvik’s metho d and introduces our notion of separable mo dels. Section 3 describ es the EPF algorithm, and Section 4 discusses the details of a p olynomial approximation sc heme for arbitrary densities, whic h Section 4.2 then applies to estimate p osterior distributions of static pa- rameters. Section 5 pro vides empirical results compar- ing the EPF to other algorithms. All details of pro ofs are given in the app endix of the full v ersion ( Erol et al. , 2013 ). 2. Bac kground In this section, we review state-space dynamical mo d- els and the basic framework of appro ximate ﬁltering algorithms. 2.1. State-space mo del and ﬁltering Let Θ b e a parameter space for a partially observ able Mark ov process { X t } t ≥ 0 , { Y t } t ≥ 0 as sho wn in Figure 1 and deﬁned as follo ws: X 0 ∼ p ( x 0 | θ ) (1) X t | x t − 1 ∼ p ( x t | x t − 1 , θ ) (2) Y t | x t ∼ p ( y t | x t , θ ) (3) Here the state v ariables X t are unobserv ed and the ob- serv ations Y t are assumed conditionally independent of other observ ations giv en X t . W e assume in this section that states X t , observ ations Y t , and parameters θ are real-v alued v ectors in d , m , and p dimensions resp ec- tiv ely . Here b oth the transition and sensor mo dels are parameterized by θ . F or simplicit y , w e will assume in the follo wing sections that only the transition model is parameterized by θ ; how ever, the results in this pap er can b e generalized to cov er sensor mo del parameters. The ﬁltering density p ( x t | y 0: t , θ ) ob eys the following recursion: p ( x t | y 0: t , θ ) = p ( y t | x t , θ ) p ( x t | y 0: t − 1 , θ ) p ( y t | y 0: t − 1 , θ ) = p ( y t | x t , θ ) p ( y t | y 0: t − 1 , θ ) Z p ( x t − 1 | y 0: t − 1 , θ ) p ( x t | x t − 1 , θ ) dx t − 1 (4) where the up date steps for p ( x t | y 0: t − 1 , θ ) and p ( y t | y 0: t − 1 , θ ) in volv e the ev aluation of integrals that are not in general tractable. 2.2. Particle ﬁltering With known parameters, particle ﬁlters can appro xi- mate the p osterior distribution ov er the hidden state X t b y a set of s ampl es. The canonical example is the sequen tial imp ortance sampling-resampling algorithm (SIR) (Algorithm 1 ). The SIR ﬁlter has v arious app ealing prop erties. It is mo dular, eﬃcient, and easy to implement. The ﬁlter tak es constant time p er up date, regardless of time T , and as the n umber of particles N → ∞ , the empirical ﬁltering density conv erges to the true marginal p oste- rior densit y under suitable assumptions. P article ﬁlters can accommo date unknown parame- ters by adding parameter v ariables in to the state vec- tor with an “identit y function” transition mo del. As noted in Section 1 this approach leads to degeneracy problems—esp ecially for high-dimensional parameter The Extended Parameter Filter Algorithm 1: Sequen tial imp ortance sampling- resampling (SIR) Input : N : num b er of particles; y 0 , . . . , y T : observ ation sequence Output : ¯ x 1: N 1: T initialize  x i 0  ; for t = 1 , . . . , T do for i = 1 , . . . , N do sample x i t ∼ p ( x t | x i t − 1 ); w i t ← p ( y t | x i t ); sample  1 N , ¯ x i t  ← Multinomial  w i t , x i t  ;  x i t  ←  ¯ x i t  ; spaces. T o ensure that some particle has initial pa- rameter v alues with bounded error, the num b er of par- ticles must grow exp onen tially with the dimension of the parameter space. 2.3. Storv ik’s algorithm T o a void the degeneracy problem, Storvik ( 2002 ) mo d- iﬁes the SIR algorithm by adding a Gibbs sampling step for θ conditioned on the state tra jectory in each particle (see Algorithm 2 ). The algorithm is devel- op ed in the SIS framework and consequently inherits the theoretical guaran tees of SIS. Storvik considers unkno wn parameters in the state evolution mo del and assumes a p erfectly known sensor mo del. His analysis can b e generalized to unknown sensor mo dels. Storvik’s approach b ecomes eﬃcient in an on-line set- ting when a ﬁxed-dimensional suﬃcien t statistic S t ex- ists for the static parameter ( i.e. , when p ( θ | x 0: t ) = p ( θ | S t ) holds). The important prop ert y of this algo- rithm is that the parameter v alue simulated at time t do es not dep end on the v alues simulated previously . This prop ert y preven ts the imp o verishmen t of the pa- rameter v alues in particles. One limitation of the algorithm is that it can only b e applied to mo dels with ﬁxed-dimensional suﬃcient statistics. How ever, Storvik ( 2002 ) analyze the suﬃ- cien t statistics for a sp eciﬁc family . Storvik ( 2002 ) shows how to obtain a suﬃcien t statis- tic in the con text of what he calls the Gaussian system pr o c ess , a transition mo del satisfying the equation x t = F T t θ +  t ,  t ∼ N (0 , Q ) (5) where θ is the vector of unkno wn parameters with a prior of N ( θ 0 , C 0 ) and F t = F ( x t − 1 ) is a ma- trix where elemen ts are p ossibly nonlinear functions of x t − 1 . An arbitrary but known observ ation mo del Algorithm 2: Storvik’s ﬁlter. Input : N : num b er of particles; y 0 , . . . , y T : observ ation sequence Output : ¯ x 1: N 1: T , θ 1: N initialize  x i 0  ; for t = 1 , . . . , T do for i = 1 , . . . , N do sample θ i ∼ p ( θ | x i 0: t − 1 ); sample x i t ∼ p ( x t | x i t − 1 , θ i ); w i ← p ( y t | x i t ); sample  1 N , ¯ x i t  ← Multinomial  w i t , x i t  ;  x i t  ←  ¯ x i t  ; is assumed. Then the standard theory states that θ | x 0: t ∼ N ( m t , C t ) where the recursions for the mean and the co v ariance matrix are as follows: D t = F T t C t − 1 F t + Q C t = C t − 1 − C t − 1 F t D − 1 t F T t C t − 1 m t = m t − 1 + C t − 1 F t D − 1 t ( x t − F T t m t − 1 ) (6) Th us, m t and C t constitute a ﬁxed-dimensional suﬃ- cien t statistic for θ . These up dates are in fact a sp ecial case of Kalman ﬁl- tering applied to the p arameter space. Matching terms with the standard KF up date equations ( Kalman , 1960 ), w e ﬁnd that the transition matrix for the KF is the identit y matrix, the transition noise co v ariance matrix is the zero matrix, the observ ation matrix for the KF is F t , and the observ ation noise cov ariance matrix is Q . This correspondence is of course what one would exp ect, since the true parameter v alues are ﬁxed (i.e., an identit y transition). See the supplemen- tary material ( Erol et al. , 2013 ) for the deriv ation. 2.4. Separabil it y In this section, w e deﬁne a condition under whic h there exist eﬃcien t up dates to parameters. Again, we fo cus on the state-space mo del as describ ed in Figure 1 and Equation ( 3 ). The mo del in Equation ( 3 ) can also b e expressed as x t = f θ ( x t − 1 ) + v t y t = g ( x t ) + w t (7) for some suitable f θ , g , v t , and w t . Deﬁnition 1. A system is separable if the tr ansi- tion function f θ ( x t − 1 ) c an b e written as f θ ( x t − 1 ) = l ( x t − 1 ) T h ( θ ) for some l ( · ) and h ( · ) and if the sto chas- tic i.i.d. noise v t has lo g-p olynomial density. The Extended Parameter Filter Theorem 1. F or a sep ar able system, ther e exist ﬁxe d- dimensional suﬃcient statistics for the Gibbs density, p ( θ | x 0: T ) . The pro of is straightforw ard by the Fisher–Neyman factorization theorem; more details are giv en in the supplemen tary material of the full version ( Erol et al. , 2013 ). The Gaussian system pro cess mo dels deﬁned in Equa- tion ( 5 ) are separable, since the transition func- tion F T t θ = ( F t ) T θ , but the prop ert y—and therefore Storvik’s algorithm—applies to a m uch broader class of systems. Moreov er, as w e now sho w, non-separable systems may in some cases b e well-appro ximated by separable systems, constructed by p olynomial density appro ximation steps applied to either the Gibbs dis- tribution p ( θ | x 0: t ) or to the transition mo del. 3. The extended parameter ﬁlter Let us consider the follo wing mo del. x t = f θ ( x t − 1 ) + v t ; v t ∼ N (0 , Σ) (8) where x ∈ R d , θ ∈ R p and f θ ( · ) : R d → R d is a vector- v alued function parameterized by θ . W e assume that the transition function f θ ma y b e non-separable. Our algorithm will create a p olynomial appro ximation to either the transition function or to the Gibbs distribu- tion, p ( θ | x 0: t ). T o illustrate, let us consider the transition mo del f θ ( x t − 1 ) = sin( θ x t − 1 ). It is apparen t that this tran- sition mo del is non-separable. If we approximate the transition function with a T a ylor series in θ centered around zero f θ ( x t − 1 ) ≈ ˆ f θ ( x t − 1 ) = x t − 1 θ − 1 3! x 3 t − 1 θ 3 + . . . (9) and use ˆ f as an approximate transition mo del, the sys- tem will b ecome separable. Then, Storvik’s ﬁlter can b e applied in constant time p er up date. This T aylor appro ximation leads to a log-polynomial densit y of the form of Equation ( 12 ). Our approach is analogous to that of the extended Kalman ﬁlter (EKF). EKF linearizes nonlinear tran- sitions around the current estimates of the mean and co v ariance and uses Kalman ﬁlter up dates for state estimation ( W elch and Bishop , 1995 ). Our prop osed algorithm, which w e call the extended parameter ﬁlter (EPF), approximates a non-separable system with a separable one, using a p olynomial approximation of some arbitrary order. This separable, approximate mo del is well-suited for Storvik’s ﬁlter and allows for Algorithm 3: Extended Parameter Filter Result : Approximate the Gibbs densit y p ( θ | x 0: t , y 0: t ) with the log-p olynomial densit y ˆ p ( θ | x 0: t , y 0: t ) Output : ˜ x 1 . . . ˜ x N initialize  x i 0  and S i 0 ← 0; for t = 1 , . . . , T do for i = 1 , . . . , N do S i t = up date( S i t − 1 , x t − 1 ) ; // update statistics for polynomial approximation log ( ˆ p ( θ | ¯ x 0: t − 1 , y 0: t − 1 )) sample θ i ∼ ˆ p ( θ | ¯ x i 0: t − 1 , y 0: t − 1 ) = ˆ p ( θ | S i t ) ; sample x i t ∼ p ( x t | ¯ x i t − 1 , θ i ) ; w i ← p ( y t | x i t , θ i ); sample  1 N , ¯ x i t , ¯ S i t  ← Multinomial  w i t , x i t , S i t  ;  x i t , S i t  ←  ¯ x i t , ¯ S i t  ; constan t time up dates to the Gibbs density of the pa- rameters. Although w e hav e describ ed an analogy to the EKF, it is imp ortan t to note that the EPF can eﬀectively use higher-order approximations instead of just ﬁrst-order linearizations as in EKF. In EKF, higher order appro x- imations lead to intractable integrals. The prediction in tegral for EKF p ( x t | y 0: t − 1 ) = Z p ( x t − 1 | y 0: t − 1 ) p ( x t | x t − 1 ) dx t − 1 can b e calculated for linear Gaussian transitions, in whic h case the mean and the co v ariance matrix are the track ed suﬃcien t statistic. How ever, in the case of quadratic transitions (or an y higher-order transitions), the ab o ve integral is no longer analytically tractable. In the case of EPF, the transition mo del is the identit y transition and hence the prediction step is trivial. The ﬁltering recursion is p ( θ | x 0: t ) ∝ p ( x t | x t − 1 , θ ) p ( θ | x 0: t − 1 ) . (10) W e approximate the transition p ( x t | x t − 1 , θ ) with a log-p olynomial density ˆ p (log-p olynomial in θ ), so that the Gibbs densit y , which satisﬁes the recursions in equation 10 , has a ﬁxed log-p olynomial structure at eac h time step. Due to the p olynomial structure, the appro ximate Gibbs density c an b e track ed in terms of its suﬃcient statistic (i.e., in terms of the co eﬃcien ts of the p olynomial). The log-p olynomial structure is deriv ed in Section 4.2 . Pseudo-code for EPF is shown in Algorithm 3. Note that the approximated Gibbs density will be a log-multiv ariate p olynomial density of ﬁxed order The Extended Parameter Filter (prop ortional to the order of the p olynomial approxi- mation). Sampling from suc h a densit y is not straigh t- forw ard but can be done b y Mon te Carlo sampling. W e suggest slice sampling ( Neal , 2003 ) or the Metrop olis- Hastings algorithm ( Robert and Casella , 2005 ) for this purpose. Although some approximate sampling sc heme is necessary , sampling from the approximated densit y remains a constant-time op eration when the dimension of ˆ p remains constan t. It is also imp ortan t to note that p erforming a p oly- nomial appro ximation for a p -dimensional parameter space may not b e an easy task. Ho wev er, we can re- duce the computational complexity of suc h appro xima- tions b y exploiting lo calit y prop erties. F or instance, if f θ ( · ) = h θ 1 ,...,θ p − 1 ( · ) + g θ p ( · ), where h is separable and g is non-separable, we only need to approximate g . In section 4 , we discuss the v alidity of the approxima- tion in terms of the KL-div ergence b et ween the true and approximate densities. In section 4.1 , we analyze the distance b et ween an arbitrary density and its ap- pro ximate form with resp ect to the order of the p oly- nomial. W e show that the distance go es to zero sup er- exp onential ly . Section 4.2 analyzes the error for the static parameter estimation problem and introduces the form of the log-p olynomial approximation. 4. Appro ximating the conditional distribution of parameters In this section, we construct approximate suﬃcient statistics for arbitrary one–dimensional state space mo dels. W e do so by exploiting log-p olynomial ap- pro ximations to arbitrary probability densities. W e pro ve that such approximations can b e made arbitrar- ily accurate. Then, we analyze the error in tro duced b y log-p olynomial appro ximation for the arbitrary one– dimensional mo del. 4.1. T a ylor approximation to an arbitrary densit y Let us assume a distribution p (kno wn only up to a normalization constant) expressed in the form p ( x ) ∝ exp( S ( x )), where S ( x ) is an analytic function on the supp ort of the distribution. In general we need a Mon te Carlo metho d to sample from this arbitrary densit y . In this section, we describ e an alternative, simpler sampling metho d. W e prop ose that with a p olynomial approximation P ( x ) (T aylor, Chebyshev etc.) of suﬃcien t order to the function S ( x ), we may sample from a distribution b p ∝ exp( P ( x )) with a sim- pler (i.e. log-polynomial) structure. W e show that the distance b et ween the distributions p and b p reduces to −2 −1 0 1 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 pdf real M=4 M=8 M=12 Figure 2. Appro ximated PDFs to the order M . 0 as the order of the appro ximation increases. The following theorem is based on T aylor approxima- tions; how ev er, the theorem can be generalized to han- dle an y p olynomial approximation scheme. The pro of is giv en in ( Erol et al. , 2013 ). Theorem 2. L et S ( x ) b e a M + 1 times diﬀer en- tiable function with b ounde d derivatives, and let P ( x ) b e its M -th or der T aylor appr oximation. Then the KL- diver genc e b etwe en distributions p and ˆ p c onver ges to 0 , sup er-exp onential ly as the or der of appr oxima- tion M → ∞ . W e v alidate the T aylor appro ximation approac h for the log-densit y S ( x ) = − x 2 + 5 sin 2 ( x ). Figure 2 sho ws the result for this case. 4.2. Onli n e approximation of the Gibbs densit y of the parameter In our analysis, w e will assume the following mo del. x t = f θ ( x t − 1 ) + v t , v t ∼ N (0 , σ 2 ) y t = g ( x t ) + w t , w t ∼ N (0 , σ 2 o ) The p osterior distribution for the static parameter is p ( θ | x 0: T ) ∝ p ( θ ) T Y t =1 p ( x t | x t − 1 , θ ) . The product term, which requires linear time, is the b ottlenec k for this computation. A p olynomial ap- pro ximation to the transition function f θ ( · ) (the T ay- lor appro ximation around θ = 0) is: f θ ( x t − 1 ) = h ( x t − 1 , θ ) = M X i =0 1 i ! d i h ( x t − 1 , θ i ) dθ   θ =0 | {z } H i ( x t − 1 ) θ i + R M ( θ ) = M X i =0 H i ( x t − 1 ) θ i + R M ( θ ) = ˆ f ( θ ) + R M ( θ ) where R M is the error for the M -dimensional T aylor appro ximation. W e deﬁne co eﬃcien ts J i x t − 1 to satisfy  P M i =0 H i ( x t − 1 ) θ i  2 = J 2 M x t − 1 θ 2 M + · · · + J 0 x t − 1 θ 0 . The Extended Parameter Filter Let ˆ p ( θ | x 0: T ) denote the appro ximation to p ( θ | x 0: T ) obtained by using the polynomial appro ximation to f θ in tro duced ab o ve. Theorem 3. ˆ p ( θ | x 0: T ) is in the exp onential family with the lo g-p olynomial density log p ( θ )+ (12)           θ 1 . . . θ M θ M +1 . . . θ 2 M           T | {z } T ( θ ) T .            1 σ 2 P T k =1 x k H 1 ( x k − 1 ) − 1 2 σ 2 P T k =1 J 1 x k − 1 . . . 1 σ 2 P T k =1 x k H M ( x k − 1 ) − 1 2 σ 2 P T k =1 J M x k − 1 − 1 2 σ 2 P T k =1 J M +1 x k − 1 . . . − 1 2 σ 2 P T k =1 J 2 M x k − 1            | {z } η ( x 0 ,...,x t ) The pro of is given in the supplementary material. This form has ﬁnite dimensional suﬃcien t statistics. Standard sampling from p ( θ | x 0: t ) requires O ( t ) time, whereas with the p olynomial approximation we can sample from this structured densit y of ﬁxed dimension in constant time (giv en that suﬃcient statistics were trac ked). W e can furthermore prov e that sampling from this exp onen tial form appro ximation is asymp- totically correct. Theorem 4. L et p T ( θ | x 0: T ) denote the Gibbs distri- bution and b p T ( θ | x 0: T ) its or der M exp onential family appr oximation. Assume that p ar ameter θ has supp ort S θ and ﬁnite varianc e. Then as M → ∞ , T → ∞ , the KL diver genc e b etwe en p T and b p T go es to zer o. lim M ,T →∞ D K L ( p T || ˆ p T ) = 0 The proof is giv en in the supplemen tary material ( Erol et al. , 2013 ). Note that the analysis ab o ve can b e generalized to higher dimensional parameters. The one dimensional case is discussed for ease of exp osition. In the general case, an order M T aylor expansion for a p dimensional parameter vector θ will hav e M p terms. Then eac h up date of the suﬃcient statistics will cost O ( M p ) p er particle, p er time step, yielding the total complexit y O ( N T M p ). How ever, as noted b efore, we can often exploit the lo cal structure of f θ to sp eed up the update step. Notice that in either case, the up date cost p er time step is ﬁxed (indep enden t of T ). 5. Exp erimen ts The algorithm is implemen ted for three sp eciﬁc cases. Note that the mo dels discussed do not satisfy the Gaussian pro cess mo del assumption of Storvik ( 2002 ). 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 x 10 −3 θ pdf T=16 T=32 T=64 T=128 T=256 T=512 T=1024 Figure 3. Sinusoidal dynamical mo del (SIN). Shrink age of the Gibbs density p ( θ | x 0: T ) with resp ect to time duration T . Note that as T grows, the Gibbs density conv erges to the true parameter v alue. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.005 0.01 0.015 θ pdf Real M=1 M=3 M=5 M=7 M=9 (a) Approximating Gibbs 0 200 400 600 800 1000 −20 −15 −10 −5 0 5 T log(D KL ) M=1 M=3 M=5 M=7 M=9 (b) KL-divergence Figure 4. Sinusoidal dynamical mo del (SIN). (a) Conv er- gence of the approximate densities to the Gibbs density p ( θ | x 0:1024 ) with resp ect to the approximation order M ; (b) KL-divergence D K L ( p | ˆ p ) with resp ect to duration T and approximation order M . 5.1. Singl e parameter nonlinear mo del Consider the following mo del with sinusoid transition dynamics (SIN): x t = sin( θ x t − 1 ) + v t , v t ∼ N (0 , σ 2 ) y t = x t + w t , w t ∼ N (0 , σ 2 obs ) (13) where σ = 1, σ obs = 0 . 1 and the Gaussian prior for parameter θ is N (0 , 0 . 2 2 ). The observ ation sequence is generated by sampling from SIN with true parameter v alue θ = 0 . 7. Figure 3 sho ws how the Gibbs densit y p ( θ | x 0: t ) shrinks with resp ect to time, hence verifying identi- ﬁabilit y for this mo del. Notice that as T grows, the densities concen trate around the true parameter v alue. A T aylor appro ximation around θ = 0 has b een applied to the transition function sin( θ x t ). Figure 4(a) shows the approximate densities for diﬀerent p olynomial or- ders for T = 1024. Notice that as the p olynomial order increases, the approximate densities con verge to the true densit y p ( θ | x 0:1024 ). The KL-div ergence D K L ( p || ˆ p ) for diﬀeren t p olyno- mial orders (N) and diﬀerent data lengths (T) is illus- The Extended Parameter Filter 200 400 600 800 1000 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 time Mean of Particles mean deviation (a) Particle ﬁlter (SIR) 200 400 600 800 1000 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 time Mean of Particles mean deviation (b) Liu–W est ﬁlter 200 400 600 800 1000 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 time Mean of Particles mean deviation (c) EPF Figure 5. Sinusoidal dynamical mo del (SIN). (a) : Particle ﬁlter (SIR) with N = 50000 particles. Note the failure to con verge to the true v alue of parameter θ (0 . 7, shown as the blue line). (b) : Liu–W est ﬁlter with N = 50000 particles. (c) : EPF with N = 1000 particles and 7-th order approximation. Note b oth SIR and Liu–W est do not conv erge, while the EPF conv erges quic kly even with orders of magnitude fewer particles. trated in Figure 4(b) . The results are consistent with the theory dev elop ed in Section 4.1 . The degeneracy of a b ootstrap ﬁlter with N = 50000 particles can be seen from ﬁgure 5(a) . The Liu– W est approach with N = 50000 particles is shown in 5(b) . The p erturbation is θ t = ρθ t − 1 + (1 − ρ ) ¯ θ t − 1 + p 1 − ρ 2 std( θ t − 1 ) N (0 , 1), where ρ = 0 . 9. No- tice that even with N = 50000 particles and large p er- turbations, the Liu–W est approac h conv erges slowly compared to our metho d. F urthermore, for high- dimensional spaces, tuning the p erturbation param- eter ρ for Liu–W est b ecomes diﬃcult. The EPF has been implemen ted on this mo del with N = 1000 particles with a 7-th order T a ylor ap- pro ximation to the p osterior. The time complexity is O ( N T ). The mean and the standard deviation of the particles are sho wn in ﬁgure 5(c) . 5.2. Cauch y dynamical system W e consider the following mo del. x t = ax t − 1 + Cauch y (0 , γ ) (14) y t = x t + N (0 , σ obs ) (15) Here Cauch y is the Cauch y distribution centered at 0 and with shap e parameter γ = 1. W e use a = 0 . 7, σ obs = 10, where the prior for the AR(1) parameter is N (0 , 0 . 2 2 ). This mo del represents autoregressiv e time ev olution with heavy-tailed noise. Such hea vy-tailed noises are observed in netw ork traﬃc data and click- stream data. The standard Cauch y distribution we use is f v ( v ; 0 , 1) = 1 π (1 + v 2 ) = exp  − log ( π ) − log(1 + v 2 )  . W e appro ximate log(1 + v 2 ) by v 2 − v 4 / 2 + v 6 / 3 − v 8 / 4 + . . . (the T aylor approximation at 0). Figure 6(a) shows the simulated hidden s tate and the observ ations ( σ obs = 10). Notice that the simulated pro cess diﬀers substantially from a s tan dard AR(1) pro cess due to the heavy-tailed noise. Storvik’s ﬁlter cannot handle this model since the necessary suﬃcient statistics do not exist. Figure 6(b) displa ys the mean v alue estimated by a b ootstrap ﬁlter with N = 50000 particles. As b efore the b o otstrap ﬁlter is unable to p erform meaningful inference. Figure 6(c) shows the performance of the Liu–W est ﬁlter with b oth N = 100 and N = 10000 particles. The Liu–W est ﬁlter do es not conv erge for N = 100 particles and conv erges slowly for N = 10000 particles. Figure 6(d) demonstrates the rapid conv er- gence of the EPF for only N = 100 particles with 10th order approximation. The time complexity is O ( N T ). Our empirical results conﬁrm that the EPF prov es use- ful for mo dels with hea vy-tailed stochastic p erturba- tions. 5.3. Smo oth T ransition AR mo del The smo oth transition AR (ST AR) mo del is a smo oth generalization of the self-exciting threshold autoregres- siv e (SET AR) mo del, ( v an Dijk et al. , 2002 ). It is generally expressed in the follo wing form. x t = ( a 1 x t − 1 + a 2 x t − 2 + · · · + a p x t − p ) [1 − G ( x t − d ; γ , c )] + ( b 1 x t − 1 + b 2 x t − 2 + · · · + b p x t − p ) [ G ( x t − d ; γ , c )] +  t where  t is i.i.d. Gaussian with mean zero and v ariance σ 2 and G ( · ) is a nonlinear function of x t − d , where d > 0. W e will use the logistic function G ( y t − d ; γ , c ) = 1 1 + exp ( − γ ( x t − d − c )) (16) F or high γ v alues, the logistic function con verges to the indicator function, I ( x t − d > c ), forcing ST AR to The Extended Parameter Filter 10 20 30 40 50 60 70 80 90 100 −300 −250 −200 −150 −100 −50 0 50 time Hidden State State Observation (a) Data sequence 20 40 60 80 100 −1.5 −1 −0.5 0 0.5 1 1.5 time Mean of Particles mean deviation (b) Particle ﬁlter (SIR) 20 40 60 80 100 −1.5 −1 −0.5 0 0.5 1 1.5 2 time Mean of Particles mean −100 particles deviation −100 particles mean −10000 particles deviation −10000 particles (c) Liu–W est ﬁlter 20 40 60 80 100 −1.5 −1 −0.5 0 0.5 1 1.5 time Mean of Particles mean deviation (d) EPF Figure 6. Cauch y dynamical system. (a) : Example sequences for hidden states and observ ations. (b) : Particle ﬁlter estimate with 50000 particles. (c) : Liu–W est ﬁlter with 100 and 10000 particles. (d) : EPF using only 100 particles and 10th order approximation. Note EPF conv erges to the actual v alue of parameter a (=0.7, in blue line) while SIR do es not even with orders of magnitude more particles, neither do es Liu–W est with the same num b er of particles. γ c 0 1 2 3 4 5 0 1 2 3 4 5 T=100 T=250 T=500 T=1000 (a) Gibbs densit y 500 1000 1500 2000 2500 −1 0 1 2 3 time Mean of Particles γ −mean γ −deviation c −mean c −deviation (b) Liu–W est ﬁlter 500 1000 1500 2000 2500 −1 0 1 2 3 time Mean of Particles γ −mean γ −deviation c −mean c−deviation (c) EPF Figure 7. ST AR mo del. (a) : Shrink age of the Gibbs density p ( γ , c | x 0: t ) with resp ect to time. (b) : Liu–W est ﬁlter using 50000 particles. (c) : EPF using 100 particles and 9th order approximation. Note the EPF’s estimates for both parameters con verge to the actual v alues q uic kly even with only 100 particles, while Liu–W est do es not conv erge at all. con verge to SET AR (SET AR corresp onds to a switch- ing linear–Gaussian system). W e will use p = 1 = d , where a 1 = 0 . 9 and b 1 = 0 . 1 and σ = 1 (corresp ond- ing to t w o diﬀerent AR(1) processes with high and lo w memory). W e attempt to estimate parameters γ , c of the logistic function, which ha v e true v alues γ = 1 and c = 3. Data (of length T = 1000) is generated from the mo del under ﬁxed parameter v alues and with ob- serv ation mo del y t = x t + w t , where w t is additive Gaussian noise with mean zero and standard devia- tion σ obs = 0 . 1. Figure 7(a) sho ws the shrink age of the Gibbs densit y p ( γ , c | x 0: T ), verifying identiﬁabil- it y . The non-separable logistic term is appro ximated as 1 1 + exp ( − γ ( x t − 1 − c )) ≈ 1 2 − 1 4 γ ( c − x t − 1 ) + 1 48 γ 3 ( c − x t − 1 ) 3 + . . . Figure 7(b) displays the failure of the Liu–W est ﬁl- ter for N = 50000 particles. Figure 7(c) sho ws the mean v alues for γ , c from EPF for only N = 100 par- ticles with 9th order T aylor approximation. Sampling from the log-p olynomial approximate density is done through the random-walk Metrop olis–Hastings algo- rithm. F or each particle path, at each time step t , the Metrop olis–Hastings sampler is initialized from the pa- rameter v alues at t − 1. The burn-in p erio d is set to b e 0, so only one MH step is tak en p er time step (i.e., if a prop osed sample is more likely it is accepted, else it is rejected with a sp eciﬁc probability). The whole ﬁlter has time complexit y O ( N T ). 6. Conclusion Learning the parameters of temporal probability mod- els remains a signiﬁcant op en problem for practical applications. W e hav e prop osed the extended parame- ter ﬁlter (EPF), a nov el approximate inference algo- rithm that com bines Gibbs sampling of parameters with computation of appro ximate suﬃcient statistics. The up date time for EPF is indep enden t of the length of the observ ation sequence. Moreov er, the algorithm has prov able error b ounds and handles a wide v ariety of mo dels. Our exp erimen ts conﬁrm these prop erties and illustrate diﬃcult cases on which EPF works w ell. One limitation of our algorithm is the complexity of T aylor approximation for high-dimensional parameter v ectors. W e noted that, in some cases, the pro cess can b e decomp osed into low er-dimensional subproblems. Automating this step w ould b e b eneﬁcial. The Extended Parameter Filter References C. Andrieu, A. Doucet, and V. T adic. On-line param- eter estimation in general state-space mo dels. In Pr o c e e dings of the 44th Confer enc e on De cision and Contr ol , pages 332–337, 2005. Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. P article Mark ov chain Monte Carlo metho ds. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 72(3):269–342, 2010. Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle ﬁlters for on-line non-linear/non-Gaussian Bay esian tracking. IEEE T r ansactions on Signal Pr o c essing , 50(2):174– 188, 2002. Carlos M. Carv alho, Michael S. Johannes, Hedib ert F. Lop es, and Nic holas G. Polson. P article Learning and Smo othing. Statistic al Scienc e , 25:88–106, 2010. doi: 10.1214/10- STS325. Arnaud Doucet and Adam M. Johansen. A tutorial on particle ﬁltering and smo othing: ﬁfteen years later. The Oxfor d Handb o ok of Nonline ar Filtering , pages 4–6, Decem b er 2011. Y usuf Erol, Lei Li, Bharath Ramsundar, and Stu- art J. Russell. The extended parameter ﬁl- ter. T ec hnical Rep ort UCB/EECS-2013-48, EECS Departmen t, Univ ersity of California, Berk eley , Ma y 2013. URL http://www.eecs.berkeley.edu/ Pubs/TechRpts/2013/EECS- 2013- 48.html . W alter R. Gilks and Carlo Berzuini. F ollowing a mo ving target – Monte Carlo inference for dynamic ba yesian mo dels. Journal of the R oyal Statistic al So ciety. Series B (Statistic al Metho dolo gy) , 63(1): 127–146, 2001. Rudolf E. Kalman. A new approach to linear ﬁltering and prediction problems. T r ansactions of the ASME – Journal of Basic Engine ering , 82 (Series D):35–45, 1960. Nic holas Kantas, Arnaud Doucet, Sumeetpal Sindhu Singh, and Jan Maciejo wski. An o verview of sequen- tial Monte Carlo metho ds for parameter estimation in general state-space mo dels. In 15th IF AC Sym- p osium on System Identiﬁc ation , volume 15, pages 774–785, 2009. Jane Liu and Mike W est. Combined parameter and state estimation in sim ulation-based ﬁltering. In Se- quential Monte Carlo Metho ds in Pr actic e . 2001. Radford M. Neal. Slice sampling. Annals of Statistics , 31(3):705–767, 2003. Nic holas G. P olson, Jonathan R. Stroud, and Peter M ¨ uller. Practical ﬁltering with sequential parame- ter learning. Journal of the R oyal Statistic al So ci- ety: Series B (Statistic al Metho dolo gy) , 70(2):413– 428, 2008. Christian P . Rob ert and George Casella. Monte Carlo Statistic al Metho ds . Springer-V erlag New Y ork, Inc., Secaucus, NJ, USA, 2005. Geir Storvik. Particle ﬁlters for state-space mo dels with the presence of unknown static paramaters. IEEE T r ansactions on Signal Pr o c essing , 50(2):281– 289, 2002. Dic k v an Dijk, Timo T ersvirta, and Philip Hans F ranses. Smo oth transition autoregressive mo dels – a survey of recent dev elopments. Ec onometric R e- views , 21:1–47, 2002. Greg W elc h and Gary Bishop. An in tro duction to the Kalman ﬁlter, 1995. The Extended P arameter Filter App endices A. Storvik’s ﬁlter as a Kalman ﬁlter Let us consider the follo wing mo del. x t = A x t − 1 + v t , v t ∼ N (0 , Q ) y t = H x t + w t , w t ∼ N (0 , R ) (17) W e will call the MMSE estimate Kalman ﬁlter returns as x t | t = E [ x t | y 0: t ] and the v ariance P t | t = cov ( x t | y 0: t ). Then the up date for the conditional mean esti- mate is as follo ws. x t | t = A x t − 1 | t − 1 + P t | t − 1 H T ( HP t | t − 1 H T + R ) − 1 | {z } K t ( y t − HA x t − 1 | t − 1 ) where as for the estimation co v ariance P t | t − 1 = AP t − 1 | t − 1 A T + Q P t | t = ( I − K t H ) P t | t − 1 (18) Matc hing the terms ab o ve to the up dates in equa- tion 6 , one will obtain a linear mo del for which the transition matrix is A = I , the observ ation matrix is H = F t , the state noise cov ariance matrix is Q = 0 , and the observ ation noise co v ariance matrix is R = Q B. Pro of of theorem 1 Let us assume that x ∈ R d , θ ∈ R p and f θ ( · ) : R d → R d is a v ector v alued function parameterized b y θ . More- o ver, due to the assumption of separabilit y f θ ( x t − 1 ) = l ( x t − 1 ) T h ( θ ), where we as sume that l ( · ) : R d → R m × d and h ( · ) : R p → R m and m is an arbitrary con- stan t. The stochastic p erturbance will ha ve the log- p olynomial density p ( v t ) ∝ exp( Λ 1 v t + v T t Λ 2 v t + . . . ) Let us analyze the case of p ( v t ) ∝ exp( Λ 1 v t + v T t Λ 2 v t ), for mathematical simplicit y . Pr o of. log p ( θ | x 0: T ) ∝ log p ( θ ) + T X t =1 log p ( x t | x t − 1 , θ ) ∝ log p ( θ ) + T X t =1 Λ 1  x t − l ( x t − 1 ) T h ( θ )  +  x t − l ( x t − 1 ) T h ( θ )  T Λ 2  x t − l ( x t − 1 ) T h ( θ )  ∝ log p ( θ ) + T X t =1 − ( Λ 1 + 2 x T t Λ 2 ) l ( x t − 1 ) T ! | {z } S 1 h ( θ ) + h T ( θ ) T X t =1 l ( x t − 1 ) Λ 2 l T ( x t − 1 ) ! | {z } S 2 h ( θ ) + constan ts Therefore, suﬃcient statistics ( S 1 ∈ R 1 × m and S 2 ∈ R m × m ) exist. The analysis can b e generalized for higher-order terms in v t in similar fashion. C. Pro of of theorem 2 Prop osition 1. L et S ( x ) b e a M + 1 times diﬀer en- tiable function and P ( x ) its or der M T aylor appr ox- imation. L et I = ( x − a, x + a ) b e an op en interval ar ound x . L et R ( x ) b e the r emainder function, so that S ( x ) = P ( x ) + R ( x ) . Supp ose ther e exists c onstant U such that ∀ y ∈ I ,    f ( k +1) ( y )    ≤ U We may then b ound ∀ y ∈ I , | R ( y ) | ≤ U a M +1 ( M + 1)! W e deﬁne the following terms  = U a M +1 ( M + 1)! Z = Z I exp( S ( x )) dx ˆ Z = Z I exp( P ( x )) dx The Extended Parameter Filter Since exp( · ) is monotone and increasing and | S ( x ) − P ( x ) | ≤  , we can deriv e tight b ounds relating Z and b Z . Z = Z I exp( S ( x )) dx ≤ Z I exp( P ( x ) +  ) dx = ˆ Z exp(  ) Z = Z I exp( S ( x )) dx ≥ Z I exp( P ( x ) −  ) dx = ˆ Z exp( −  ) Pr o of. D K L ( p || ˆ p ) = Z I ln  p ( x ) ˆ p ( x )  p ( x ) dx = Z I  S ( x ) − P ( x ) + ln( ˆ Z ) − ln( Z )  p ( x ) dx ≤ Z I | S ( x ) − P ( x ) | p ( x ) dx + Z I    ln( ˆ Z ) − ln( Z )    p ( x ) dx ≤ 2  ∝ a M +1 ( M + 1)! ≈ 1 p 2 π ( M + 1)!  ae M + 1  M +1 where the last appro ximation follo ws from Stirling’s appro ximation. Therefore, D K L ( p || ˆ p ) → 0 as M → ∞ . D. Pro of of theorem 3 Pr o of. log ˆ p ( θ | x 0: T ) = log p ( θ ) T Y k =0 ˆ p ( x k | x k − 1 , θ ) ! = log p ( θ ) + T X k =0 log ˆ p ( x k | x k − 1 , θ ) W e can calculate the form of log ˆ p ( x k | x k − 1 , θ ) explic- itly . log ˆ p ( x k | x k − 1 , θ ) = log N ( ˆ f ( x k − 1 , θ ) , σ 2 ) = − log ( σ √ 2 π ) − ( x k − ˆ f ( x k − 1 , θ )) 2 2 σ 2 = − log ( σ √ 2 π ) − x 2 k − 2 x k ˆ f ( x k − 1 , θ ) + ˆ f ( x k − 1 , θ ) 2 2 σ 2 = − log ( σ √ 2 π ) − x 2 k 2 σ 2 − P M i =0 x k H i ( x k − 1 ) θ i σ 2 + P 2 M i =0 J i x k − 1 θ i 2 σ 2 Using this expansion, w e calculate log ˆ p ( θ | x 0: T ) = log p ( θ ) + T X k =0 log ˆ p ( x k | x k − 1 , θ ) = log p ( θ ) − ( T + 1) log ( σ √ 2 π ) − 1 2 σ 2 T X k =0 x 2 k ! − T ( θ ) T η ( x 0 , . . . , x T ) where we expand T ( θ ) T η ( x 0 , . . . , x T ) as in 3 . The form for log ˆ p ( θ | x 0: T ) is in the exp onen tial family . E. Pro of of theorem 4 Pr o of. Assume that function f has b ounded deriv a- tiv es and b ounded supp ort I . Then the maximum error satisﬁes    f θ ( x k − 1 ) − ˆ f θ ( x k − 1 )    ≤  k . It follo ws that ˆ f θ ( x k − 1 ) 2 − f θ ( x k − 1 ) 2 = −  2 k − 2 ˆ f θ ( x k − 1 )  k ≈ − 2 ˆ f θ ( x k − 1 )  k . Then the KL-div ergence betw een the real posterior and the approximated p osterior satisﬁes the following form ula. D K L ( p T || ˆ p T ) (19) = Z S θ 1 σ 2 T X k =1  k ( x k − ˆ f θ ( x k − 1 )) ! p T ( θ | x 0: T ) dθ Moreo ver, recall that as T → ∞ the p osterior shrinks to δ ( θ − θ ∗ ) b y the assumption of iden tiﬁability . Then w e can rewrite the KL-divergence as (assuming T aylor appro ximation centered around θ c ) lim T →∞ D K L ( p T || ˆ p T ) (20) = 1 σ 2 lim T →∞ T X k =1  k Z S θ ( x k − ˆ f θ ( x k − 1 )) p T ( θ | x 0: T ) dθ = 1 σ 2 lim T →∞ T X k =1  k · (21) x k − M X i =0 H i ( x k − 1 ) Z S θ ( θ − θ c ) i p ( θ | x 0: T ) dθ ! = 1 σ 2 lim T →∞ T X k =1  k x k − M X i =0 H i ( x k − 1 )( θ ∗ − θ c ) i ! If the cen ter of the T aylor approximation θ c is the true parameter v alue θ ∗ , w e can show that lim T →∞ D K L ( p T || ˆ p T ) = 1 σ 2 lim T →∞ T X k =1  k ( x k − f θ ∗ ( x k − 1 ))) = 1 σ 2 lim T →∞ T X k =1  k v k = 0 (22) The Extended Parameter Filter where the ﬁnal statement follows from law of large n umbers. Thus, as T → ∞ , the T aylor approximation of any order will conv erge to the true p osterior given that θ c = θ ∗ . F or an arbitrary center v alue θ c , D K L ( p T || ˆ p T ) = 1 σ 2 T X k =1  k x k − M X i =0 H i ( x k − 1 )( θ ∗ − θ c ) i ! (23) Notice that  k ∝ 1 ( M +1)! (b y our assumptions that f has b ounded deriv ative and is supp orted on interv al I ) and H i ( · ) ∝ 1 M ! . The inner summation will b e b ounded since M ! > a M , ∀ a ∈ R as M → ∞ . There- fore, as M → ∞ , D K L ( p || ˆ p ) → 0.

The Extended Parameter Filter

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment