Bayesian Nonparametric Inference of Switching Linear Dynamical Systems
Many complex dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressi…
Authors: Emily B. Fox, Erik B. Sudderth, Michael I. Jordan
MIT LIDS TECHNICAL REPOR T #2830 1 Bayesian Nonparametric Inference of Switching Linear Dynamical Systems Emily Fox, Erik Sudderth, Michael Jordan, and Alan W i llsky Abstract Many complex dy namical phe nomena can be effecti vely modeled by a system that switches among a set of condition ally linear dyna mical m odes. W e consid er two such mo dels: the switching linear dyn amical system (SLDS) and the switching vector auto regressi ve (V AR) pro cess. Our Bayesian nonp arametric a pproach utilizes a hierar chical Dirichlet process prior to learn an u nknown number of persistent, smo oth d ynamica l modes. W e addition ally employ automatic rele vance determinatio n to infer a sparse set of dyn amic depende ncies allowing us to learn SLD S with varying state dimension o r switching V AR processes with varying au toregressiv e or der . W e dev elop a sampling algorithm th at combines a truncated ap prox imation to the Dirich let pr ocess with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on sy nthetic data, sequences o f dancing honey bees, the IBO VESP A stock ind ex, an d a maneuvering target trackin g application. Index T erms Bayesian nonpa rametric m ethods, hidden Markov m odel, Markov jum p linear system, time series. I . I N T RO D U C T I O N L INEAR dynamica l s ystems (LDS s) are useful in describing d ynamical phenome na as diverse as human motion [3], [4], fin ancial time-series [5]–[7], man euvering targets [8], [9], and the d ance of h oney bees [10 ]. Howe ver , such ph enomena often exhibit structural change s over time, and the LD S mode ls which describe the m must also change . For example, a ba llistic missile makes an ev asive maneu ver; a c ountry experiences a rec ession, a cen tral bank intervention, or some national or global e ven t; a h oney bee changes from a waggle to a turn right da nce. Some of these c hanges will app ear frequently , while othe rs are on ly rarely o bserved. In addition, there is always the possibility of a new , previously unseen dy namical b ehavior . Th ese conside rations moti vate us to develop a Bayesian non parametric approac h for learning s witching LDS (SLDS) mode ls. W e also cons ider a special ca se of the SL DS—the switching vector autoregress i ve (V AR) model—in which direct obse rvati ons of the unde rlying dynamical proces s a re as sumed available. One can view the SLDS, and the simpler switching V AR process, as an extension of hidden Markov mo dels (HMMs) in which eac h HMM state, or mode , is a ssocia ted with a linear dynamica l process. While the HMM makes a strong Markovian assump tion that o bservations are c onditionally ind epende nt given the mode, the SLDS and switching V AR proce sses are ab le to ca pture more complex temporal d epende ncies often pres ent in real data. Most existing methods for learning SLDS and switching V AR proces ses rely on either fix ing the nu mber of H MM modes, such as in [10], or co nsidering a change-po int detection formulation w here each inferred change is to a new , previously u nseen dynamica l mod e, such as in [11]. In this paper we sh ow how one can remain agno stic about the number of dynamical modes while still allowi ng for returns to p reviously exhibited dyna mical be haviors. Hierarchical Dirichlet proc esse s (HDP) can be used as a prior on the parame ters of HMMs with unknown mo de space cardinality [12], [13]. In this paper we us e a variant of the HDP-HMM—the stic ky HDP-HMM of [14]—that provides improved con trol over the number of modes inferred; su ch control is c rucial for the problems we examine. Our Baye sian n onparametric approac h for lea rning switching dyn amical proc esses extends the sticky HDP-HMM formulation to learn an unkn own number of pe rsistent dynamical mod es a nd thereby ca pture a wider rang e of temporal depend encies. W e then exp lore a method for lea rning w hich co mponents o f the underlying state vector contribute to the dyn amics of eac h mode by employing au tomatic relevance determina tion (ARD) [15]–[17]. Th e E. Fox is with the Department of S tatistical Science, Duke Univ ersit y , Durham, NC, 27708 USA e-mail: fox@stat.duke .edu. E. S udderth is with the Department of Computer Science, Brown University , Providence, RI, 02912 USA e-mail: sudderth@cs.bro wn.edu. M. Jordan is with the D epartment of Electrical Engineering and Computer Science, and Department of Statistics, Univ ersity of California, Berkeley , CA, 94720 US A e-mail: jordan@eecs.berke ley .edu. A. Willsky is with the Department of Electrical Engineering and C omputer Science, MIT , Cambridge, MA, 02139 USA e-mail: willsky@mit.edu. This work was supported in part by MURIs funded through AFOS R Grant F A 9550-06 -1-0324 and ARO Grant W911NF-06-1-0076. Preliminary versions (without detailed dev elopment or analysis) of this work have been presented at two conferences [1], [2]. MIT LIDS TECHNICAL REPOR T #2830 2 resulting model allows for learning realizations of SLDS that switch betwee n an un known nu mber of d ynamical modes with possibly varying state dimensions , o r switching V AR proc esses with varying autoregressive orders . A. Pr evious Sys tem Identification T echniqu es Paoletti et. al. [18] provide a s urvey of recent approach es to iden tification of switching d ynamical models. Th e most gen eral formulation of the problem in volves learning: (i) the number of dy namical modes , (ii) the mo del order , an d (iii) the associated dyna mic pa rameters. For noiseles s sw itching V AR process es, V idal et. al. [19] pres ent an exact algebraic approa ch, though relying o n fixing a ma ximal mode sp ace ca rdinality and autoregress i ve order . Psaradak is and Spa gnolog [20 ] alternati vely co nsider a penalized likeli hood approa ch to ide ntification of stochas tic switching V AR proces ses. For SLDS, identification is s ignificantly mo re ch allenging, an d metho ds typically rely on simplifying assump tions such as d eterministic dynamics or knowledge of the mode sp ace. Huang et. a l. [21] prese nt an approac h that assumes deterministic dynamics and embeds the input/output data i n a higher-dimensional space and finds the switching times by segme nting the da ta into distinct su bspac es [22]. K otsalis e t. al. [23] develop a balan ced trunca tion algorithm for SLDS assuming the mode switches are i.i.d. within a fixed, finite s et; the authors also present a method for model-order reduction of HMMs 1 . In [25], a realization theory is presented for generalized jump-Markov linear systems (GJMLS ) in which the dynamic matrix d epends both o n the previous mod e a nd c urrent mode . Fina lly , when the number of dyn amical mode s is assume d known, Gh ahramani a nd Hinton [26] prese nt a variational approach to segmenting the da ta into the linea r dynamical regimes an d learning the a ssociate d dyna mic pa rameters 2 . For questions of observability an d iden tifiability of SLDS in the absen ce of nois e, s ee [27]. In the Baye sian approac h that we adopt, we cohe rently incorporate n oisy dyn amics and u ncertainty in the mo de space ca rdinality . Our choice of p rior pe nalizes more comp licated models, both in terms of the numbe r of modes a nd the sta te dimen sion des cribing each mode, a llo wing us to distinguish between the set of equiv a lent models de scribed in [27]. Thu s, instead of placing hard constraints o n the mode l, we simply increase the p osterior probability of simpler explana tions of the data. As op posed to a p enalized likelihood approa ch using Aka ike’ s information criterion (AIC) [28] or the Bay esian information criterion (BIC) [29], our a pproach provides a model complexity pe nalty in a pu rely Bayesian man ner . In S ec. II, we provide bac kground on the switching linea r d ynamical systems we c onsider herein, and previous Bayesian nonparametric methods of learning HMMs. Ou r Bayesian nonpara metric switching li near dyna mical systems a re des cribed in Sec. III . W e proceed by an alyzing a conjugate p rior on the dynamic p arameters, and a sp arsity-inducing prior that allo ws for variable-order switching proc esses . Th e section c oncludes by ou tlining a Gibbs samp ler for the propo sed models. In Sec . IV we prese nt results on syn thetic and real datasets, and in Sec. V we ana lyze a set of a lternati ve formulations that a re commo nly found in the maneuvering target tracking and ec onometrics literature. I I . B AC K G RO U N D A. Switching L inear Dyn amic S ystems A state spa ce (SS) model consists of an underlying state, x t ∈ R n , with dyn amics obse rved via y t ∈ R d . A linear time-in variant (L TI) SS mod el is given b y x t = A x t − 1 + e t y t = C x t + w t , (1) where e t and w t are indepe ndent Gau ssian noise p rocesse s with covariances Σ and R , respec ti vely . An orde r r V AR proces s, d enoted by V AR( r ), with observations y t ∈ R d , ca n be de fined a s y t = r X i =1 A i y t − i + e t e t ∼ N (0 , Σ) . (2) 1 The problem of identification of HMMs is thoroughly analyzed in [24]. 2 This formulation uses a mi xtur e of experts SLDS in which M different continuous-v alued st ate sequences e volv e independently with linear dynamics and the Markovian dynamical mode selects which state sequence is observe d at a giv en time. MIT LIDS TECHNICAL REPOR T #2830 3 Every V AR( r ) proc ess ca n be de scribed in SS form, though not ev ery SS mod el may be expresse d as a V AR( r ) process for fin ite r [30]. The dynamica l phenomena we examine in this paper exhibit behaviors be tter modeled as switches between a set of linear dynamical models. W e define a switching linear dyn amical s ystem (SLDS) by z t | z t − 1 ∼ π z t − 1 x t = A ( z t ) x t − 1 + e t ( z t ) y t = C x t + w t . (3) The first-order Ma rkov proces s z t with transition d istrib utions { π j } indexes the mode-specific LDS a t time t , which is dri ven by Gaussian noise e t ( z t ) ∼ N (0 , Σ ( z t ) ) . One can view the SLDS as an extension of the c lassical hidde n Markov mode l (HMM) [31], wh ich has the sa me mo de ev olution, but c onditionally indepe ndent ob servations: z t | z t − 1 ∼ π z t − 1 y t | z t ∼ F ( θ z t ) (4) for an indexed family of distrib utions F ( · ) whe re θ i are the e mission parame ters for mode i . W e s imilarly define a switching V AR( r ) process by z t | z t − 1 ∼ π z t − 1 y t = r X i =1 A ( z t ) i y t − i + e t ( z t ) . (5) B. Dirichlet Pr oc esses a nd the Sticky HDP-HMM T o examine a Bayesian nonpa rametric S LDS, and thu s relax the assumption that the n umber of d ynamical modes is known a nd fixed, it is useful to first an alyze s uch me thods for the simpler HMM. One can e quiv alen tly represent the fi nite HMM of Eq. (4) via a set of trans ition pr oba bility meas ures G j = P K k =1 π j k δ θ k , where δ θ is a mass concen trated at θ . W e then ope rate directly in the para meter s pace Θ and transition between e mission parameters with probab ilities giv e n by { G j } . That is, θ ′ t | θ ′ t − 1 ∼ G j : θ ′ t − 1 = θ j y t | θ ′ t ∼ F ( θ ′ t ) . (6) Here, θ ′ t ∈ { θ 1 , . . . , θ K } and is equivalent to θ z t of Eq . (4). A Baye sian nonparame tric HMM takes G j to b e random 3 with an infin ite collection of atoms co rresponding to the infinite HMM mode spa ce. The Dirichlet pr ocess (DP), denoted by DP( γ , H ) , provides a distrib ution over discrete probability measures with an infinite collection of atoms G 0 = ∞ X k =1 β k δ θ k θ k ∼ H , (7) on a parameter spa ce Θ . The we ights are sampled via a stick-br eak ing cons truction [32]: β k = ν k k − 1 Y ℓ =1 (1 − ν ℓ ) ν k ∼ Beta (1 , γ ) . (8) In effect, we have divided a u nit-length stick into leng ths given b y the weights β k : the k th weight is a ran dom proportion ν k of the remaining stick after the p revious ( k − 1) weights h av e been de fined. W e d enote this distrib ution by β ∼ GEM ( γ ) . The D irichlet proce ss ha s proven use ful in many app lications due to its clus tering properties, which a re clea rly seen by examining the pr edic tive distribution of draws θ ′ i ∼ G 0 . Be cause probab ility measu res drawn from a Dirichlet process are discrete, there is a strictly po siti ve probability of multiple obse rvati ons θ ′ i taking identical values within the set { θ k } , w ith θ k defined as in Eq. (7). For each value θ ′ i , let z i be an ind icator rand om variable 3 Formally , a random measure on a measurable space Θ wit h sigma algebra A is defined as a stochastic process whose index set is A . That is, G ( A ) i s a r andom v ariable for each A ∈ A . MIT LIDS TECHNICAL REPOR T #2830 4 (a) (b) Fig. 1. Sticky HDP- HMM prior on ( a) switching V AR(2) and (b) SLDS p rocesses with t he mode e volving as z t +1 |{ π k } ∞ k =1 , z t ∼ π z t for π k | α, κ, β ∼ DP ( α + κ, ( αβ + κδ k ) / ( α + κ )) . Here , β | γ ∼ GEM ( γ ) and θ k | H, λ ∼ H ( λ ) . The d ynamical processes are as in T ab le I. that picks ou t the unique v alue θ k such that θ ′ i = θ z i . Blackwe ll and Ma cQueen [33] introduce d a P ´ olya urn representation of the θ ′ i : θ ′ i | θ ′ 1 , . . . , θ ′ i − 1 ∼ γ γ + i − 1 H + i − 1 X j =1 1 γ + i − 1 δ θ ′ j = γ γ + i − 1 H + K X k =1 n k γ + i − 1 δ θ k . (9) Here, n k is the nu mber of ob servations θ ′ i taking the v a lue θ k . From Eq. (9), and the discrete n ature of G 0 , we see a reinforcement prope rty of the Dirichlet proc ess that indu ces spa rsity in the number o f inferred mixture compo nents. A hierarchical extension of the Dirichlet p rocess, the hierarchical Dirichlet process (HDP) [12], has proven useful in defining a prior on the set of HMM tr ansition probability measures G j . The HDP defines a collection of probability measures { G j } on the same sup port points { θ 1 , θ 2 , . . . } by as suming that each discrete measure G j is a variation on a global discrete measure G 0 . Specifica lly , the Bay esian h ierarchical s pecifica tion takes G j ∼ DP( α, G 0 ) , with G 0 itself a draw from a Dirichlet proce ss DP ( γ , H ) . Through this con struction, o ne can show that the proba bility measures are d escribed as G 0 = P ∞ k =1 β k δ θ k β | γ ∼ GEM ( γ ) G j = P ∞ k =1 π j k δ θ k π j | α, β ∼ DP( α, β ) θ k | H ∼ H . (10) Applying the HDP prior to the HMM, we o btain the HDP-HMM of T eh e t. al. [12 ]. This corresponds to the model in Fig. 1(a), but without the edges be tween the ob servations. By de fining π j ∼ DP( α, β ) , the HDP prior e ncourage s modes to have similar transition distributions. Namely , the mod e-specific transition distributions are identical in expec tation: E [ π j k | β ] = β k . (11) Howe ver , it doe s n ot differentiate self–transitions from moves between modes . When mo deling d ynamical process es with mode p ersistence, the flexible nature of the HDP-HMM p rior allows for mod e sequ ences with unrea listically fast d ynamics to ha ve lar g e posterior proba bility . Recen tly , it has been shown [14] that one may mitigate this problem by instead cons idering a sticky HDP-HMM where π j is distributed a s follows: π j | β , α, κ ∼ DP α + κ, αβ + κδ j α + κ . (12) Here, ( αβ + κδ j ) indicates that an amount κ > 0 is added to the j th compone nt of αβ . This construction increases the expected probability of se lf-transition by an amount proportional to κ . Specific ally , the exp ected s et of we ights for transition distribution π j is a con vex comb ination of those defined by β an d mode -specific we ight defined by κ : E [ π j k | β , α, κ ] = α α + κ β k + κ α + κ δ ( j, k ) . (13) When κ = 0 t he original H DP-HMM of T eh et. al. [12] is rec overed. W e place a p rior on κ and learn the self-transition bias from the da ta. MIT LIDS TECHNICAL REPOR T #2830 5 HDP-AR-HMM HDP-SLDS Mode dynamics z t | z t − 1 ∼ π z t − 1 z t | z t − 1 ∼ π z t − 1 Observ ation dynamics y t = P r i =1 A ( z t ) i y t − i + e t ( z t ) x t = A ( z t ) x t − 1 + e t ( z t ) y t = C x t + w t T ABLE I D Y N A M I C E Q UAT I O N S F O R T H E H D P - A R - H M M A N D H D P - S L D S . H E R E , π j I S A S D E FI N E D I N E Q . (12) F O R T H E S T I C K Y H D P - H M M . T H E A D D I T I V E N O I S E P RO C E S S E S A R E D I S T R I B U T E D A S e t ( k ) ∼ N (0 , Σ ( k ) ) A N D w t ∼ N (0 , R ) . HDP-AR-HMM HDP-SLDS Dynamic matrix A ( k ) = [ A ( k ) 1 . . . A ( k ) r ] ∈ R d × ( d ∗ r ) A ( k ) = A ( k ) ∈ R n × n Pseudo-observ ations ψ t = y t ψ t = x t Lag pseudo-obse rva tions ¯ ψ t = [ y T t − 1 . . . y T t − r ] T ¯ ψ t = x t − 1 . T ABLE II N O TA T I O NA L C O N V E N I E N C E S U S E D I N D E S C R I B I N G T H E G I B B S S A M P L E R F O R T H E H D P - A R - H M M A N D H D P - S L D S . I I I . T H E H D P - S L D S A N D H D P - A R - H M M W e now con sider a s ignificant extension of the sticky HDP-HMM for bo th SLDS and V AR modeling, ca pturing dynamic structure underlying the observ ations by allo wing switching among unkno wn number of unknown dynamics using Ba yesian non parametric method s to capture these un certainties (and to allow both learning the number of modes and estimating system state). Fig. 1 (b) illustrates the HDP-SLDS mo del, while Fig. 1(a) illustrates the HDP-AR-HMM mo del (for the case of V AR( 2)). The gene rati ve processe s for these two models are summarized in T able I. For the H DP-SLDS, we place priors on the dyn amic parame ters { A ( k ) , Σ ( k ) } and on meas urement n oise R and infer their posterior from the data. However , without loss of generality 4 , we fix the measurement matrix to C = [ I d 0] implying that it is the first d co mponents of the s tate that are me asured. Our ch oice of the state dimen sion n is , in esse nce, a choice of model order , and an issue we ad dress in Sec. III-A2. For the HDP-AR-HMM, we similarly plac e a prior on the d ynamic parame ters, wh ich in this case co nsist of { A ( k ) 1 , . . . , A ( k ) r , Σ ( k ) } . In Sec. III-B we deri ve a Gibbs sampling inference scheme for our models. Th ere is, of course, a diff erence between the steps required for SLDS-based model (in which there is an unobserved continuous-valued state x t ) and the AR-ba sed mode l. In particular , for the HDP-SLDS the algorithm iterates a mong the followi ng steps : 1) Sample the state seq uence x 1: T giv e n the mode se quence z 1: T and SLDS p arameters { A ( k ) , Σ ( k ) , R } . 2) Sample the mode se quenc e z 1: T giv e n the state se quence x 1: T , HMM pa rameters { π k } , and dy namic param- eters { A ( k ) , Σ ( k ) } . 3) Sample the HMM parame ters { π k } and SLDS parame ters { A ( k ) , Σ ( k ) , R } given the sequence s, z 1: T , x 1: T , and y 1: T . For the HDP-AR-HMM, step (1) doe s not exist. Step (2) then in volv e s sampling the mode s equenc e z 1: T giv e n the o bservations y 1: T (rather t han x 1: T ), and step (3) in volves conditioning solely on the s equen ces z 1: T and y 1: T (not x 1: T ). Also, we note that step (2) in volv es a fairly straightforward extens ion of the sampling metho d developed in [14] for the simpler HDP-HMM model; the othe r steps, ho wever , in volve ne w constructs, as they in volve capturing and d ealing with the temporal dyn amics o f the und erlying continuous state mod els. Sec. III-A provides the nec essary priors and structure of the pos teriors n eeded to develop thes e s teps. A. Priors and P osteriors of Dynamic P arameters W e begin by developing a prior to regularize the learning of the dy namic parameters (and meas urement n oise) conditioned on a fixed m ode assign ment z 1: T . T o make the connections between the samplers for t he HDP-SLDS and HDP-AR-HMM explicit, we introduce the conce pt of p seudo-ob serva tions ψ 1: T and rewr ite the d ynamic eq uation for both the HDP-SLDS and HDP-AR-HMM g enerically as ψ t = A ( k ) ¯ ψ t − 1 + e t , (14) where we utilize the de finitions outlined in T ab le II. 4 This i s, in essence, an issue of choosing a similarity t ransformation for the state of a minimal system, exploiting the fact t hat the measurement matrix is shared by all modes of the HDP-SLDS so that the same tr ansformation can be used for all modes. MIT LIDS TECHNICAL REPOR T #2830 6 For the HDP-AR-HMM, w e have simply written the dy namic e quation in T ab le I in matrix form b y co ncatena ting the lag matrices into a s ingle matrix A ( k ) and forming a lag ob serva tion vector ¯ ψ t comprised of a series o f p revious observation vectors. For this section (for the HDP -SLDS), we a ssume s uch a sample of the s tate se quence x 1: T (and hence { ψ t , ¯ ψ t } ) is av ailable so that Eq. (14) applies equally well to both the HDP-SLDS and the HDP-AR-HMM. Methods for re sampling this state sequ ence are discus sed in Sec. III-B. Conditioned on the mode s equenc e, one may partition this dyna mic se quence into K dif ferent linear regression problems, where K = |{ z 1 , . . . , z T }| . That is, for eac h mode k , we may form a matri x Ψ ( k ) with n k columns consisting of the ψ t with z t = k . Th en, Ψ ( k ) = A ( k ) ¯ Ψ ( k ) + E ( k ) , (15) where ¯ Ψ ( k ) is a ma trix of the asso ciated ¯ ψ t − 1 , and E ( k ) the asso ciated noise vectors. 1) Conjugate Prior o n { A ( k ) , Σ ( k ) } : The matrix-normal in verse-W ishart (MNIW) prior [34] is conjuga te to the likelihood model defi ned in Eq . (15) for the pa rameter set { A ( k ) , Σ ( k ) } . Although this prior is typically used for inferring the pa rameters of a single linear regress ion problem, it is equally applicable to ou r s cenario sinc e the linear regression prob lems of Eq . (15 ) are inde penden t conditioned on the mod e seque nce z 1: T . W e note that a lthough the MNIW prior does not enforce stability co nstraints on each mode, this prior is still a rea sonab le choice sinc e each mode need not have stable dyna mics for the SLDS to b e stable [35], and con ditioned o n da ta from a stable mode, the p osterior distribution will likely be sharply pea ked arou nd stable d ynamic matrices. Let D ( k ) = { Ψ ( k ) , ¯ Ψ ( k ) } . The posterior distribution of the dynamic parame ters for the k th mode decomp oses as p ( A ( k ) , Σ ( k ) | D ( k ) ) = p ( A ( k ) | Σ ( k ) , D ( k ) ) p (Σ ( k ) | D ( k ) ) . (16) The resulting posterior of A ( k ) is straightforwardly derived to be (see [36]) p ( A ( k ) | Σ ( k ) , D ( k ) ) = MN A ( k ) ; S ( k ) ψ ¯ ψ S − ( k ) ¯ ψ ¯ ψ , Σ ( k ) , S ( k ) ¯ ψ ¯ ψ , (17) with B − ( k ) denoting ( B ( k ) ) − 1 for a gi ven ma trix B , MN ( A ; M , K, V ) de noting a matrix- normal distribution with mean ma trix M a nd left and right covariances K and V , a nd S ( k ) ¯ ψ ¯ ψ = ¯ Ψ ( k ) ¯ Ψ ( k ) T + K S ( k ) ψ ¯ ψ = Ψ ( k ) ¯ Ψ ( k ) T + M K S ( k ) ψψ = Ψ ( k ) Ψ ( k ) T + M K M T . (18) The marginal posterior o f Σ ( k ) is p (Σ ( k ) | D ( k ) ) = IW n k + n 0 , S ( k ) ψ | ¯ ψ + S 0 , (19) where IW ( n 0 , S 0 ) de notes an in verse-W ishart prior with n 0 degrees of freedom and s cale matrix S 0 , a nd is updated by data terms S ( k ) ψ | ¯ ψ = S ( k ) ψψ − S ( k ) ψ ¯ ψ S − ( k ) ¯ ψ ¯ ψ S ( k ) T ψ ¯ ψ and n k = |{ t | z t = k , t = 1 , . . . , T }| . 2) Alternative Prior — Automatic Re levance Deter mination: The MNIW prior lead s to full A ( k ) matrices, which (i) bec omes problematic as the model order g rows in the presenc e of limited da ta; and (ii) does not p rovide a method for identifying irrelev a nt mo del compone nts (i.e. state compone nts in the case of the HDP-SLDS or lag components in the case of the HDP-AR-HMM.) T o jointly address thes e issues , we alternatively con sider automatic r elevance determination (ARD) [15]–[17], which enc ourages dri ving compo nents of the model parameters to zero if the ir presenc e is n ot sup ported b y the data. For the HDP-SLDS, we harness the c oncep ts o f ARD by placing indepen dent, zero-mean, sphe rically symmetric Gaussian priors on the columns of the dyn amic ma trix A ( k ) : p ( A ( k ) | α ( k ) ) = n Y j =1 N a ( k ) j ; 0 , α − ( k ) j I n . (20) Each precision pa rameter α ( k ) j is given a Ga mma ( a, b ) prior . The zero-mea n Gauss ian p rior pen alizes non-zero columns of the dyna mic matrix by an amount determined by the precision parameters. Iterativ e estimation of these hyperparame ters α ( k ) j and the d ynamic matrix A ( k ) leads to α ( k ) j becoming large for column s whose evidence in the data is insufficient for overcoming the pen alty induced by the prior . Having α ( k ) j → ∞ driv es a ( k ) j → 0 , implying that the j th state compon ent d oes not co ntrib u te to the dynamics of the k th mode. Thus, examining the set of MIT LIDS TECHNICAL REPOR T #2830 7 lar ge α ( k ) j provides ins ight into the order of that mode. Loo king at the k th dynamical mode alone, having a ( k ) j = 0 implies that the realization of that m ode is not minimal since the a ssociated Hankel matrix H = C T C A T · · · ( C A d − 1 ) T T G AG · · · A d − 1 G ≡ O R (21) has reduc ed rank. Howev e r , the overall SLDS rea lization may s till be minimal. For our us e of the ARD p rior , we restrict attention to mod els s atisfying the p roperty that the state compo nents that are ob served are relev an t to all mod es of the dynamics : Criterion 3.1: If for s ome realization R a mode k h as a ( k ) j = 0 , then that realization mus t have c j = 0 , whe re c j is the j th column of C . Here we assume , without loss of gen erality , that the ob served states are the first components of the state vector . This ass umption implies that o ur cho ice o f C = [ I d 0] do es not interfere with learning a spa rse rea lization 5 . The ARD prior may also b e use d to lea rn variable-order s witching V AR proces ses. He re, the goal is to “turn off ” entire lag bloc k s A ( k ) i (whereas in the HDP-SLDS we were interested in eliminating columns of the dynamic matrix.) Instead of plac ing inde pende nt Gaussia n priors on ea ch column of A ( k ) as we did in E q. (20), we de compose the prior over the lag bloc ks A ( k ) i : p ( A ( k ) | α ( k ) ) = r Y i =1 N vec ( A ( k ) i ); 0 , α − ( k ) i I d 2 . (22) Since ea ch eleme nt of a gi ven lag bloc k A ( k ) i is distributed acc ording to the s ame prec ision paramete r α ( k ) i , if that parameter bec omes large the entire lag block will te nd to zero. In order to examine the po sterior d istrib u tion on the dyna mic ma trix A ( k ) , it is useful to conside r the Ga ussian induced by Eq. (20) and Eq. (22) on a vectorization of A ( k ) . Our ARD prior on A ( k ) is equiv alent to a N (0 , Σ ( k ) 0 ) prior on vec ( A ( k ) ) , where Σ ( k ) 0 = diag α ( k ) 1 , . . . , α ( k ) 1 , . . . , α ( k ) m , . . . , α ( k ) m − 1 . (23) Here, m = n for the HDP-SLDS with n replicates of e ach α ( k ) i , and m = r for the HDP-AR-HMM w ith d 2 replicates of α ( k ) i . (Recall that n is the d imension of the HDP-SLDS state vector x t , r the a utoregressiv e o rder of the HDP-AR-HMM, an d d the dimension o f the ob servations y t .) T o examine the pos terior distrib ution of A ( k ) , we note that we ma y rewrit e the state equation as, ψ t +1 = ¯ ψ t, 1 I ℓ ¯ ψ t, 2 I ℓ · · · ¯ ψ t,ℓ ∗ r I ℓ vec ( A ( k ) ) + e t +1 ( k ) ∀ t | z t = k , ˜ Ψ t vec ( A ( k ) ) + e t +1 ( k ) , (24) where ℓ = n for the HDP-SLDS and ℓ = d for the HDP-AR-HMM. Using Eq. (24), we d eri ve the poste rior distrib u tion as p ( vec ( A ( k ) ) | D ( k ) , Σ ( k ) , α ( k ) ) = N − 1 X t | z t = k ˜ Ψ T t − 1 Σ − ( k ) ψ t , Σ − ( k ) 0 + X t | z t = k ˜ Ψ T t − 1 Σ − ( k ) ˜ Ψ t − 1 . (25) See [36] for a detailed deri vation. Here, N − 1 ( ϑ, Λ) represents a Gaussian N ( µ, Σ) wit h information pa rameters ϑ = Σ − 1 µ and Λ = Σ − 1 . Giv en A ( k ) , and reca lling that eac h precision parame ter is gamma distributed, the posterior of α ( k ) ℓ is given by p ( α ( k ) ℓ | A ( k ) ) = Gamma a + |S ℓ | 2 , b + P ( i,j ) ∈S ℓ a ( k ) 2 ij 2 . (26) The s et S ℓ contains the indice s for which a ( k ) ij has prior precis ion α ( k ) ℓ . N ote that in this model, regardless of the number of obs ervations y t , the size of S ℓ (i.e., the numbe r of a ( k ) ij used to inform the po sterior d istrib ution) 5 If there does not exist a realization R satisfying Criterion 3.1, we may instead consider a more general model where the measuremen t equation is mode-specific and we place a prior on C ( k ) instead of fixing this matrix. Ho wev er, this model leads to identifiability issues that are considerably less pronou nced in the abov e case. MIT LIDS TECHNICAL REPOR T #2830 8 remains the same . Thus , the ga mma prior is an informativ e p rior and the c hoice of a and b sho uld de pend upon the ca rdinality of S ℓ . For the HDP-SLDS, this cardinality is given by the maximal s tate dimens ion n , and for the HDP-AR-HMM, by the squa re of the o bservation dimension ality d 2 . W e the n place a n in verse-W isha rt prior IW ( n 0 , S 0 ) o n Σ ( k ) and look a t the posterior given A ( k ) : p (Σ ( k ) | D ( k ) , A ( k ) ) = IW n k + n 0 , S ( k ) ψ | ¯ ψ + S 0 , (27) where here, a s oppo sed to in Eq . (19), we defin e S ( k ) ψ | ¯ ψ = X t | z t = k ( ψ t − A ( k ) ¯ ψ t − 1 )( ψ t − A ( k ) ¯ ψ t − 1 ) T . ( 2 8) 3) Measu r e ment No ise P osterior: For the HDP-SLDS , we additionally pla ce a n IW ( r 0 , R 0 ) prior on the mea- surement noise covariance R . The posterior distribution is g i ven by p ( R | y 1: T , x 1: T ) = IW ( T + r 0 , S R + R 0 ) , (29) where S R = P T t =1 ( y t − C x t )( y t − C x t ) T . Here, we a ssume that R is sha red between mo des. The extension to mode-spec ific meas urement n oise is straightforward. B. Gibbs Sampler For inference in the HDP-AR-HMM, we use a Gibbs s ampler tha t iterates between sa mpling the mode se quence , z 1: T , an d the set of d ynamic and sticky HDP-HMM pa rameters. The samp ler for the HDP-SLDS is ide ntical with the additional s tep of sampling the s tate se quenc e, x 1: T , and c onditioning on this s equen ce when res ampling dyn amic parameters a nd the mod e seq uence . Periodically , we interleave a step that se quentially samples the mode se quence z 1: T marginalizing over the state s equenc e x 1: T in a similar vein to that of Ca rter and K oh n [37]. W e desc ribe the sampler in terms of the p seudo-ob servations ψ t , as defi ned by Eq. (14), in order to clea rly specify the sec tions of the sa mpler shared by b oth the HDP -AR-HMM and HDP-SLDS. 1) Sampling Dynamic P arameters { A ( k ) , Σ ( k ) } : Conditioned on the mode sequence , z 1: T , and the pseudo- observations, ψ 1: T , we can sample the dynamic parameters θ = { A ( k ) , Σ ( k ) } from the posterior densities of Sec. III- A . For the ARD prior , we then sample α ( k ) giv e n A ( k ) . In practice we iterate multi ple times between sampling α ( k ) giv e n A ( k ) and A ( k ) giv e n α ( k ) before moving to the next sampling stag e. 2) Sampling Meas urement Noise R (HDP-SLDS on ly): For the HDP-SLD S, we a dditionally sample the me a- surement noise covariance R co nditioned on the sampled state seque nce x 1: T . 3) Block Sampling z 1: T : As shown in [14 ], the mixing rate o f the Gibbs sampler for the HDP-HMM c an be dramatically improved b y using a tru ncated approximation to the HDP and jointly s ampling the mode s equen ce using a variant of the forward-backward algorithm. In the case of o ur switching dyna mical systems, we must acc ount for the direct correlations in the o bservations in our likelihood co mputation. The variant of the forward-backward algorithm we use here then in volv es compu ting backward mess ages m t +1 ,t ( z t ) ∝ p ( ψ t +1: T | z t , ¯ ψ t , π , θ ) for ea ch z t ∈ { 1 , . . . , L } with L the chos en trunca tion lev e l, followed b y rec ursiv e ly sa mpling e ach z t conditioned on z t − 1 from p ( z t | z t − 1 , ψ 1: T , π , θ ) ∝ p ( z t | π z t − 1 ) p ( ψ t | ¯ ψ t − 1 , A ( z t ) , Σ ( z t ) ) m t +1 ,t ( z t ) . (30) Joint samp ling of the mode seq uence is espec ially important when the o bservations are directly correlated via a dynamical p rocess since this correlation further slows the mixing rate o f the seque ntial sampler of T eh et. al. [12]. Note that using an order L weak limit approximation to the HDP still en courages the us e of a sparse sub set of the L pos sible d ynamical mod es. 4) Block Sampling x 1: T (HDP-SLDS only): Conditioned on t he mode sequenc e z 1: T and the set o f SLDS parameters θ = { A ( k ) , Σ ( k ) , R } , our dynamical process simpli fies to a ti me-varying linear dynamical system. W e can then block s ample x 1: T by first runn ing a ba ckward Kalman filter to c ompute m t +1 ,t ( x t ) ∝ p ( y t +1: T | x t , z t +1: T , θ ) and then re cursiv ely s ampling eac h x t conditioned on x t − 1 from p ( x t | x t − 1 , y 1: T , z 1: T , θ ) ∝ p ( x t | x t − 1 , A ( z t ) , Σ ( z t ) ) p ( y t | x t , R ) m t +1 ,t ( x t ) . (31) MIT LIDS TECHNICAL REPOR T #2830 9 The messag es are given in information form by m t,t − 1 ( x t − 1 ) ∝ N − 1 ( x t − 1 ; ϑ t,t − 1 , Λ t,t − 1 ) , where the information parameters are rec ursiv ely d efined as ϑ t,t − 1 = A ( z t ) T Σ − ( z t ) ˜ Λ t ( C T R − 1 y t + ϑ t +1 ,t ) Λ t,t − 1 = A ( z t ) T Σ − ( z t ) A ( z t ) − A ( z t ) T Σ − ( z t ) ˜ Λ t Σ − ( z t ) A ( z t ) , (32) with ˜ Λ t = (Σ − ( z t ) + C T R − 1 C + Λ t +1 ,t ) − 1 . Th e standard ϑ b t | t and Λ b t | t updated information parameters for a backward running Kalman fi lter are g i ven by Λ b t | t = C T R − 1 C + Λ t +1 ,t ϑ b t | t = C T R − 1 y t + ϑ t +1 ,t . (33) See [36] for a deriv ation and for a more numerically stab le version of this recursion. 5) Seque ntially Sampling z 1: T (HDP-SLDS o nly): For the HDP-SLDS, iterating betwee n the previous sampling stages can lead to slow mixing rates s ince the mode seq uence is sa mpled con ditioned on a sample of the state se- quence . For high-dimensional state s paces R n , this problem is exacerbated . Instead, one ca n analytically marginalize the state sequen ce and s equentially sa mple the mode sequen ce from p ( z t | z \ t , y 1: T , π , θ ) . This marginalization is accomplishe d by onc e again harnessing the fact that conditioned on the mode se quence , our model reduces to a time- varying linea r dyn amical sys tem. When sampling z t and c onditioning on the mode seq uence at all other time steps, we ca n run a forward Kalman filter to ma r g inalize the state s equenc e x 1: t − 2 producing p ( x t − 1 | y 1: t − 1 , z 1: t − 1 , θ ) , and a backward filter to marginalize x t +1: T producing p ( y t +1: T | x t , z t +1: T , θ ) . Th en, for e ach possible value of z t , w e comb ine thes e forward and bac kward message s with the local likelihood p ( y t | x t ) an d loca l dynamic p ( x t | x t − 1 , θ , z t = k ) a nd marginalize over x t and x t − 1 resulting in the likelihood of the obse rvati on s equenc e y 1: T as a function of z t . Th is likelihood is combined with the prior probability of trans itioning from z t − 1 to z t = k and from z t = k to z t +1 . The resu lting distributi on is given by: p ( z t = k | z \ t , y 1: T , π , θ ) ∝ π z t − 1 ( k ) π k ( z t +1 ) | Λ ( k ) t | 1 / 2 | Λ ( k ) t + Λ b t | t | 1 / 2 exp − 1 2 ϑ ( k ) T t Λ − ( k ) t ϑ ( k ) t + 1 2 ( ϑ ( k ) t + ϑ b t | t ) T (Λ ( k ) t + Λ b t | t ) − 1 ( ϑ ( k ) t + ϑ b t | t ) (34) with Λ ( k ) t = (Σ ( k ) + A ( z t ) Λ − f t − 1 | t − 1 A ( z t ) T ) − 1 ϑ ( k ) t = (Σ ( k ) + A ( z t ) Λ − f t − 1 | t − 1 A ( z t ) T ) − 1 A ( z t ) Λ − f t − 1 | t − 1 ϑ f t − 1 | t − 1 . (35) See [36] for full deriv a tions. Here, ϑ f t | t and Λ f t | t are the up dated information p arameters for a forward running Kalman filter , defin ed recursiv ely a s Λ f t | t = C T R − 1 C + Σ − ( z t ) − Σ − ( z t ) A ( z t ) ( A ( z t ) T Σ − ( z t ) A ( z t ) + Λ f t − 1 | t − 1 ) − 1 A ( z t ) T Σ − ( z t ) ϑ f t | t = C T R − 1 y t + Σ − ( z t ) A ( z t ) ( A ( z t ) T Σ − ( z t ) A ( z t ) + Λ f t − 1 | t − 1 ) − 1 ϑ f t − 1 | t − 1 . (36) Note that a seque ntial n ode ordering for this sampling s tep allows for efficient upda tes to the rec ursiv e ly defi ned filter p arameters. However , this s equen tial sampling is still computationa lly intensive, s o o ur Gibbs sampler iterates between bloc ked sa mpling of the s tate a nd mo de s equenc es many times before interleaving a seq uential mode sequen ce sa mpling step. The resulting Gibbs sa mpler is outlined in Algorithm 1. I V . R E S U LT S A. MNIW pr ior W e begin by examining a set of three synthe tic datase ts displayed in Fig. 2(a) in order to analyze the relative modeling power of the HDP-V AR( 1 )-HMM 6 , HDP-V AR( 2 )-HMM, and HDP-SLDS using the MNIW prior . W e compare to a b aseline sticky HDP-HMM using first dif ference ob servations, imitating a HDP -V AR( 1 )-HMM with 6 W e use the notation HDP-V AR( r )- HMM to refer t o an order r HDP-AR-HMM wit h vector observ ations. MIT LIDS TECHNICAL REPOR T #2830 10 Gi ven a previous set of mode-spe cific transition proba bilities π ( n − 1) , the global transition distributi on β ( n − 1) , and dynamic parameters θ ( n − 1) : 1) Set π = π ( n − 1) , β = β ( n − 1) , and θ = θ ( n − 1) . 2) If HDP-SLDS, a) For ea ch t ∈ { 1 , . . . , T } , c ompute { ϑ f t | t , Λ f t | t } as in Eq. (36). b) For ea ch t ∈ { T , . . . , 1 } , i) Comp ute { ϑ b t | t , Λ b t | t } a s in Eq. (33). ii) For each k ∈ { 1 , . . . , L } , compute { ϑ ( k ) t , Λ ( k ) t } as in Eq. (35) and se t f k ( y 1: T ) = | Λ ( k ) t | 1 / 2 | Λ ( k ) t + Λ b t | t | − 1 / 2 exp − 1 2 ϑ ( k ) T t Λ − ( k ) t ϑ ( k ) t + 1 2 ( ϑ ( k ) t + ϑ b t | t ) T (Λ ( k ) t + Λ b t | t ) − 1 ( ϑ ( k ) t + ϑ b t | t ) . iii) Sa mple a mode as signment z t ∼ L X k =1 π z t − 1 ( k ) π k ( z t +1 ) f k ( y 1: T ) δ ( z t , k ) . c) W orking seq uentially forward in time sample x t ∼ N ( x t ; (Σ − ( z t ) + Λ b t | t ) − 1 (Σ − ( z t ) A ( z t ) x t − 1 + ϑ b t | t ) , (Σ − ( z t ) + Λ b t | t ) − 1 ) . d) Set pseud o-observations ψ 1: T = x 1: T . 3) If HDP-AR-HMM, s et ps eudo-obs ervations ψ 1: T = y 1: T . 4) Block samp le z 1: T giv e n trans ition distributi ons π , dynamic parameters θ , a nd pseudo -observations ψ 1: T as in Algorithm 2. 5) Update the g lobal transition distrib ution β (util izing auxiliary variables m , w , and ¯ m ), mode -specific transition distributions π k , and hyperpa rameters α , γ , and κ as in [14]. 6) For each k ∈ { 1 , . . . , L } , sample d ynamic parameters ( A ( k ) , Σ ( k ) ) given the ps eudo-ob servations ψ 1: T and mode se quence z 1: T as in A lgorithm 3 for the MNIW prior an d Algorithm 4 for the ARD prior . 7) If HDP-SLDS, also samp le the measu rement n oise covariance R ∼ IW T + r 0 , T X t =1 ( y t − C x t )( y t − C x t ) T + R 0 ! . 8) Fix π ( n ) = π , β ( n ) = β , and θ ( n ) = θ . Algorithm 1: HDP-SLDS and HDP -AR-HMM Gibbs s ampler . A ( k ) = I for all k . In Fig. 2(b)-(e) we display Hamming distance errors that are ca lculated by choosing the optimal mapping of indices maximizing overlap betwee n the true a nd estimated mode seque nces. W e place a Ga mma ( a, b ) prior on the sticky HDP-HMM con centration parameters α + κ and γ , and a Beta ( c, d ) prior on the self-transition proportion pa rameter ρ = κ/ ( α + κ ) . W e c hoose the weak ly informativ e s etting of a = 1 , b = 0 . 01 , c = 10 , and d = 1 . The details on setting the MNIW hyperparameters from statistics of the d ata are discuss ed in the App endix. For the first scen ario (Fig. 2 (top)), the data we re generated from a five-mode s witching V AR( 1 ) p rocess with a 0.98 probability of self-transition and equ ally likely transitions to the other modes. The sa me mo de-transition structure was used in the s ubsequ ent two scena rios, as well. Th e three switching linear dyn amical mo dels provide comparable performance since both the HDP-V AR( 2 )-HMM and HDP-SLDS with C = I 3 contain the class of HDP-V AR( 1 )-HMMs. In the second sc enario (Fig. 2 (middle)), the d ata we re generated from a 3-mode switching AR( 2 ) process. The HDP-AR( 2 )-HMM ha s s ignificantly better performance than the HDP-AR( 1 )-HMM while the performance of the HDP -SLDS with C = [1 0] performs similarly , but has greater posterior variability beca use the HDP-AR(2)-HMM mo del family is smaller . No te that the HDP-SLDS sampler is s lower to mix sinc e the h idden, continuous s tate is also s ampled. The data in the third sce nario (Fig. 2 (bottom)) we re gene rated from a three-mode MIT LIDS TECHNICAL REPOR T #2830 11 Gi ven mode-spe cific transition probabilities π , dynamic parameters θ , and pseudo -observations ψ 1: T : 1) Calculate mess ages m t,t − 1 ( k ) , initialized to m T + 1 , T ( k ) = 1 , and the s ample mode se quence z 1: T : a) For ea ch t ∈ { T , . . . , 1 } and k ∈ { 1 , . . . , L } , co mpute m t,t − 1 ( k ) = L X j =1 π k ( j ) N ψ t ; r X i =1 A ( j ) i ψ t − i , Σ ( j ) ! m t +1 ,t ( j ) b) W orking seq uentially forward in time, s tarting with transitions coun ts n j k = 0 : i) For e ach k ∈ { 1 , . . . , L } , compute the prob ability f k ( ψ t ) = N y t ; r X i =1 A ( k ) i ψ t − i , Σ ( k ) ! m t +1 ,t ( k ) ii) Sample a mode as signment z t as follows an d increment n z t − 1 z t : z t ∼ L X k =1 π z t − 1 ( k ) f k ( ψ t ) δ ( z t , k ) Note that the likelihoods can be precomputed for eac h k ∈ { 1 , . . . , L } . Algorithm 2: Blocked mo de-seque nce sample r for HDP-AR-HMM o r HDP-SLDS. Gi ven pseud o-observations ψ 1: T and mode seque nce z 1: T , for e ach k ∈ { 1 , . . . , K } : 1) Construct Ψ ( k ) and ¯ Ψ ( k ) as in E q. (15). 2) Compute sufficient statistics using p seudo -observations ψ t assoc iated with z t = k : S ( k ) ¯ ψ ¯ ψ = ¯ Ψ ( k ) ¯ Ψ ( k ) T + K S ( k ) ψ ¯ ψ = Ψ ( k ) ¯ Ψ ( k ) T + M K S ( k ) ψψ = Ψ ( k ) Ψ ( k ) T + M K M T . 3) Sample dyn amic parameters: Σ ( k ) ∼ IW n k + n 0 , S ( k ) ψ | ¯ ψ + S 0 A ( k ) | Σ ( k ) ∼ MN A ( k ) ; S ( k ) ψ ¯ ψ S − ( k ) ¯ ψ ¯ ψ , Σ ( k ) , S ( k ) ¯ ψ ¯ ψ . Algorithm 3: Parameter s ampling using MNIW prior . SLDS mod el w ith C = I 3 . Here, we clea rly see that neither the HDP-V AR( 1 )-HMM nor HDP-V AR( 2 )-HMM is equiv alen t to the HDP-SLDS. Note that all of the switching models yielde d significa nt improvements relativ e to the baseline sticky HDP -HMM. This input represe ntation is more effecti ve tha n using raw o bservations for HDP-HMM learning, but still much less effecti ve than riche r mode ls which sw itch among le arned LDS. T oge ther , these res ults demonstrate both the differences betwee n ou r mode ls as well as the models’ ability to le arn switching proc esses with varying n umbers of modes. B. ARD pr ior W e now compare the utility of the ARD prior to the MNIW prior using the HDP-SLDS model whe n the true underlying dyna mical mode s have s parse depe ndenc ies relative to the assumed mo del order 7 . W e generated da ta from a two-mode S LDS with 0.98 proba bility of se lf-transition and A (1) = 0 . 8 − 0 . 2 0 − 0 . 2 0 . 8 0 0 0 0 A (2) = − 0 . 2 0 0 . 8 0 . 8 0 − 0 . 2 0 0 0 , 7 That is, the HDP -SLDS may ha ve dynamical regimes r eliant on lo wer state dimensions, or the HDP-AR- HMM may have modes described by lower order V AR processes. MIT LIDS TECHNICAL REPOR T #2830 12 Gi ven pseud o-observations ψ 1: T , mode seque nce z 1: T , and a previous s et of dynamic parameters ( A ( k ) , Σ ( k ) , α ( k ) ) , for ea ch k ∈ { 1 , . . . , K } : 1) Construct ˜ Ψ t as in E q. (24). 2) Iterate multiple times be tween the followi ng s teps: a) Construct Σ ( k ) 0 giv e n α ( k ) as in E q. (23) an d sample the dynamic ma trix: vec ( A ( k ) ) | Σ ( k ) , α ( k ) ∼ N − 1 X t | z t = k ˜ Ψ T t − 1 Σ − ( k ) ψ t , Σ − ( k ) 0 + X t | z t = k ˜ Ψ T t − 1 Σ − ( k ) ˜ Ψ t − 1 . b) For ea ch ℓ ∈ { 1 , . . . , m } , with m = n for the SLDS an d m = r for the s witching V AR, s ample ARD precision parameters: α ( k ) ℓ | A ( k ) ∼ Gamma a + |S ℓ | 2 , b + P ( i,j ) ∈S ℓ a ( k ) 2 ij 2 . c) Compute sufficient s tatistic: S ( k ) ψ | ¯ ψ = X t | z t = k ( ψ t − A ( k ) ¯ ψ t − 1 )( ψ t − A ( k ) ¯ ψ t − 1 ) T and sample proc ess noise c ov ariance: Σ ( k ) | A ( k ) ∼ IW n k + n 0 , S ( k ) ψ | ¯ ψ + S 0 . Algorithm 4: Parameter s ampling using ARD prior . 0 500 1000 1500 2000 −2 0 2 4 6 8 10 12 14 Time 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 0 200 400 600 800 1000 0 2 4 6 8 10 12 14 16 Time 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 0 200 400 600 800 1000 0 50 100 150 200 Time 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance (a) (b) (c) (d) (e) Fig. 2. (a) Observ ation sequence (blue, green, r ed) and associated mode sequence (magenta) for a 5-mode switching V AR( 1 ) process (top), 3-mode switching AR( 2 ) process (middle), and 3-mode SLDS (bottom). The associated 10th, 50th, and 90th Hamming distance quantiles ov er 100 trials are sho wn for the (b) HDP- V AR( 1 )-HMM, (c) HDP-V AR( 2 )-HMM, (d) HDP-S LDS wi th C = I (top and bottom) and C = [1 0] (middle), and (e) sticky HDP-HMM using first difference observ ati ons. with C = [ I 2 0] , Σ (1) = Σ (2) = I 3 , and R = I 2 . The first dyn amical p rocess can be equ i valently described by just the first and se cond s tate comp onents since the third compo nent is simply white noise that does n ot contribute to MIT LIDS TECHNICAL REPOR T #2830 13 0 500 1000 1500 2000 −20 −10 0 10 20 30 Time Observations 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 1000 2000 3000 4000 0 0.1 0.2 0.3 0.4 0.5 0.6 Iteration Normalized Hamming Distance 0 500 1000 1500 2000 0 20 40 60 80 100 Val ue Counts ARD hy per: x1 ARD hy per: x2 ARD hy per: x3 0 500 1000 1500 2000 0 20 40 60 80 100 Val ue Counts ARD hy per: x1 ARD hy per: x2 ARD hy per: x3 (a) (b) (c) (d) (e) Fig. 3. (a) Observation sequence ( green, blue) and mode sequence (magenta) of a 2-mode SLDS, where the first mode can be realized by t he first two state components and t he second mode solely by the fi rst. The associated 10th, 50th, and 90th Hamming distance quantiles ov er 100 trials are sho wn for t he (b) MNIW and (c) ARD prior . (d)-(e) Histograms of inferred ARD precisions associated wi th the fi rst and second dynamical modes, respecti vely , at the 5000th Gibbs iteration. Larger values correspond to non-dynamical compone nts. the state dyn amics and is not directly (or indirectly) obs erved. For the second dyn amical proc ess, the third state compone nt is o nce ag ain a white no ise proces s, but d oes co ntrib u te to the dyna mics o f the first and se cond state compone nts. Howe ver , we can eq uiv alently represent the dynamics of this mode as x 1 ,t = − 0 . 2 x 1 ,t − 1 + ˜ e 1 ,t x 2 ,t = 0 . 8 x 1 ,t − 1 + ˜ e 2 ,t ˜ A (2) = − 0 . 2 0 0 0 . 8 0 0 0 0 0 , where ˜ e t is a white noise term de fined by the original process n oise co mbined with x 3 ,t , and ˜ A (2) is the dynamica l matrix assoc iated with this equiv alent representation of the s econd dy namical mode. Notice that this SL DS d oes not satisfy Criterion 3 .1 sinc e the sec ond co lumn of A (2) is zero wh ile the second co lumn of C is n ot. Nevertheless, becaus e the realization is in our cano nical form with C = [ I 2 0] , we still expect to rec over the a (2) 2 = a (2) 3 = 0 sparsity structure. W e se t the parameters o f the Gamma ( a, b ) prior on the ARD prec isions as a = |S ℓ | an d b = a/ 1000 , where we rec all the definition of S ℓ from Eq. (26). This specifica tion fixes the me an of the prior to 1 000 while aiming to provide a p rior that is equa lly informative for various cho ices of mode l o rder (i.e., sizes |S ℓ | ). In Fig. 3, we se e that even in this low-dimensional example, the ARD provides supe rior mode-seque nce estimates, as we ll as a mecha nism for identifying non-dynamica l state compon ents. The histograms of the inferred α ( k ) are shown in Fig. 3(d)-(e). From the clea r separation betwee n the sampled dy namic range of α (1) 3 and ( α (1) 1 , α (1) 2 ), and between that of ( α (2) 2 , α (2) 3 ) and α (2) 1 , we see that we are a ble to correctly identify dyna mical systems with a (1) 3 = 0 and a (2) 2 = a (2) 3 = 0 . C. Dancing Hon e y Be es Honey bees perform a se t o f da nces within the beehive in o rder to co mmunicate the location of food so urces. Specifica lly , they switch betwee n a set of waggle , turn-righ t , and turn -left dance s. During the waggle d ance, the bee walks rou ghly in a straight line while rapidly s haking its body from left to right. The turning danc es simply in volve the bee turning in a clockwise or countercloc kwise direction. W e display s ix s uch se quence s of honey bee dances in Fig. 4. Th e da ta consist o f mea surements y t = [cos( θ t ) sin( θ t ) x t y t ] T , where ( x t , y t ) denotes the 2D co ordinates of the be e’ s body and θ t its head ang le 8 . Both Oh et. al. [10] an d Xuan and Murphy [11 ] used s witching dynamica l mode ls to analyze thes e honey bee da nces. W e wish to analyz e the performance of our Bayesian nonparame tric variants of these models in segmenting the s ix seque nces into the d ance labels displaye d in Fig. 4. MNIW Prior — Un super vised: W e start by testing the HDP-V AR( 1 ) -HMM using a MNIW prior . (Note tha t we did not s ee performance ga ins by cons idering the HDP-SLDS, so we omit showing res ults for that a rchitecture.) W e se t the prior distributi ons on the dy namic pa rameters and h yperparameters as in Sec. IV -A for the synthe tic data examples, with the MNIW prior ba sed on a pre-proces sed o bservation sequen ce. The pre-proces sing in volves centering the p osition observati ons a round 0 and scaling each component of y t to be within the same d ynamic range. W e compare our results to those of Xuan and Murphy [11], who u sed a chan ge-point d etection tech nique for inference on this dataset. As sh own in Fig. 5(d) a nd (h), our model ach iev e s a s uperior segmentation co mpared to 8 The data are available at http://www .cc.gatech.edu/ borg/ijcv psslds/. MIT LIDS TECHNICAL REPOR T #2830 14 200 400 600 800 1000 −1 −0.5 0 0.5 1 Time Sine of Head Angle 200 400 600 800 1000 −1 −0.5 0 0.5 1 Time Sine of Head Angle 100 200 300 400 500 600 −1 −0.5 0 0.5 1 Time Sine of Head Angle 100 200 300 400 500 600 700 −1 −0.5 0 0.5 1 Time Sine of Head Angle 200 400 600 800 −1 −0.5 0 0.5 1 Time Sine of Head Angle 100 200 300 400 500 600 −1 −0.5 0 0.5 1 Time Sine of Head Angle (1) (2) (3) (4) (5) (6) Fig. 4. T op: Trajectories of the dancing hone y bees for sequ ences 1 t o 6, colored by waggle (r ed), turn right (blue), and turn left (green) dances. Bottom: Si ne of the bee’ s head angle measurements colored by ground truth labels. 200 400 600 800 0 0.1 0.2 0.3 0.4 0.5 Iteration Normalized Hamming Distance 200 400 600 800 0 0.1 0.2 0.3 0.4 0.5 Iteration Normalized Hamming Distance 200 400 600 800 0 0.1 0.2 0.3 0.4 0.5 Iteration Normalized Hamming Distance 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Detection Rate False Alarm Rate HDP−VAR−HMM, unsupervised HDP−VAR−HMM, supervised Change−point formulation Viterbi sequence (a) (b) (c) (d) 0 200 400 600 1 2 3 4 Estimated mode Time 0 200 400 600 800 1 2 3 Estimated mode Time 0 200 400 600 1 2 3 4 Estimated mode Time 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Detection Rate False Alarm Rate HDP−VAR−HMM, unsupervised HDP−VAR−HMM, supervised Change−point formulation Viterbi sequence (e) (f) (g) (h) Fig. 5. (a)-(c) The 10th, 50th, and 90th Hamming distance quantiles over 100 t rials are shown for sequences 4, 5, and 6, respectiv ely . (e)-(g) Estimated mode sequences representing the median error f or sequences 4, 5, and 6 at the 200th Gibbs iterati on, with errors indicated in red. (d) and (h) ROC curv es for the unsupervised HDP -V AR-HMM, partially supervised HDP-V AR-HMM, and change-point formulation of [11] using the V iterbi sequence for segmenting datasets 1-3 and 4-6, respectiv ely . the c hange-po int formulation in a lmost a ll ca ses, while also identifying modes which reocc ur over time. Oh et. al. [10] also presented an a nalysis of the honey be e da ta, u sing an SLDS with a fi xed numb er of modes. Unfortunately , that analys is is not directly compa rable to ours, be cause Oh et. a l. [10] use d their SLDS in a supervised formulation in wh ich the ground truth labe ls for all but one o f the sequ ences are emp loyed in the inferen ce of the labels for the remaining held-ou t seque nce, and in which the kernels use d in the MCMC procedure depend on the g round truth labels. (The autho rs also cons idered a “ parameterized s egmental SLDS (PS-SLDS ), ” which makes us e of domain knowledge spec ific to honey bee da ncing and requires additional supervision during the learning proce ss.) Nonetheles s, in T able III we report the performance o f these me thods a s well as the median performance (over 1 00 trials) of the unsup ervised HDP-V AR( 1 )- HMM in order to provide a sens e of the level of performanc e a chiev a ble without detailed, manu al su pervision. As se en in T able III, the HDP-V AR( 1 )-HMM yields very g ood pe rformance on sequ ences 4 to 6 in terms of the learned segmentation and number of modes (see F ig. 5); the performance approach es that of the s upervised method . For seque nces 1 to 3—which are much less regular than se quence s 4 to 6—the performance of the u nsupervise d proced ure is su bstantially worse. In Fig. 4, we se e the extreme variation MIT LIDS TECHNICAL REPOR T #2830 15 Sequence 1 2 3 4 5 6 HDP-V AR( 1 )-HMM unsupervised 45.0 42.7 47.3 88.1 92.5 88.2 HDP-V AR( 1 )-HMM partially supervised 55.0 86.3 81.7 89.0 92.4 89.6 SLDS DD-MCMC 74.0 86.1 81.3 93.4 90.2 90.4 PS-SLDS DD-MCMC 75.9 92.4 83.1 93.4 90.4 91.0 T ABLE III M E D I A N L A B E L A C C U R A C Y O F T H E H D P - V A R ( 1 ) - H M M U S I N G U N S U P E RV I S E D A N D PA RT I A L LY S U P E R V I S E D G I B B S S A M P L I N G , C O M PA R E D TO AC C U R AC Y O F T H E S U P E RV I S E D P S - S L D S A N D S L D S P RO C E D U R E S , WH E R E T H E L A T T E R A L G O R I T H M S W E R E B A S E D O N A S U P E RV I S E D M C M C P RO C E D U R E ( D D - M C M C ) [ 1 0] . AR1 AR2 AR7 ARD7 −2500 −2400 −2300 −2200 −2100 −2000 −1900 AR1 AR2 AR7 ARD7 −2100 −2000 −1900 −1800 −1700 AR1 AR2 AR7 ARD7 −1600 −1550 −1500 −1450 −1400 −1350 −1300 (a) (b) (c) Fig. 6. For an order 1, 2, and 7 HDP -AR-HMM with a MNIW prior and an order 7 HDP-AR-HMM with an ARD prior, we plot t he shortest intervals containin g 95% of the held-out log-likelihoods calculated based on a set of Gibbs samples taken at iteration 1000 from 100 chains. (a) Log-likelihood of the secon d half of honey bee dance seque nce 4 based on model parameters inferred from the first half of the sequence. (b)-(c) S imilarly for sequences 5 and 6, respectiv ely . in hea d angle during the waggle danc es o f sequ ences 1 to 3. 9 As no ted by Oh , the tracking results based o n the vision-based tracker are noisier for these s equenc es and the p atterns of switching betwee n dance modes is more irregular . This d ramatically affects our performance s ince we do not use do main-specific information. Indeed, o ur learned s egmentations cons istently identify turn-right and turn-left modes, b u t often create a new , seque nce-spe cific waggle danc e mode . Many of our errors can be attributed to creating multiple waggle da nce modes within a sequen ce. Overall, howev e r , we are a ble to achieve reaso nably good segmentations without having to man ually input doma in-specific knowledge. MNIW Prior — P a rtially S upervise d: The discrep ancy in performance between our resu lts a nd the su pervised approach of Oh et. al. [10] motiv a ted u s to a lso cons ider a partially supe rvised variant of the HDP-V AR( 1 )-HMM in which we fix the ground truth mod e sequenc es for five out of six o f the se quenc es, and jointly infer both a combined set of d ynamic parameters and the left-out mod e seque nce. This is equi valent to informing the prior distrib u tions with the data from the fiv e fixed sequence s, and using these updated posterior distrib utions a s the prior d istrib u tions for the h eld-out seq uence. As we see in T a ble III, this partially supervised app roach c onsiderably improves performanc e for thes e three seq uence s, espe cially sequen ces 2 and 3. Here, we hand -aligned s equen ces so that the waggle danc es tend ed to have head angle meas urements c entered about π / 2 radians. Aligning the waggle dances is pos sible by looking at the high frequency portions of the he ad angle meas urements. Ad ditionally , the pre-process ing o f the unsupe rvised a pproach is not app ropriate here a s the scalings an d shiftings are d ance-spe cific, and s uch trans formations modify the assoc iated switching V AR(1) mod el. Ins tead, to account for the varying frames of reference (i.e., p oint of origin for eac h bee bod y) we allowed for a mean µ ( k ) on the proc ess n oise, an d plac ed an independe nt N (0 , Σ 0 ) prior on this parameter . See the Appendix for details on how the hyperparameters of these prior d istrib u tions are s et. ARD Prior : Using the c leaner seq uences 4 to 6, we in vestigate the affects of the spa rsity-inducing ARD prior by ass uming a highe r order switching V AR mode l a nd compu ting the likelihood of the s econd half of eac h d ance sequen ce based o n pa rameters inferred from Gibbs s ampling us ing the data from the first half of each se quenc e. In Fig. 6 , we specifically c ompare the p erformance of an HDP-V AR( r )-HMM with a conjugate MNIW prior for r = 1 , 2 , 7 to tha t of an HDP-V AR(7)-HMM with an ARD prior . W e u se the same approach to setting the hyperparame ters as in Sec. IV -B. W e s ee that as suming a higher o rder model improves the predicti ve likelihood 9 From F ig. 4, we also see that ev en in sequences 4 to 6, the ground truth labeling appear to be inaccurate at times. Specifically , certain time steps are labeled as wag gle dances (red) that look more typical of a turning dance (green, blue). MIT LIDS TECHNICAL REPOR T #2830 16 performance, but only when combined with a regularizing p rior (e. g., the ARD) tha t av oids over -fitting in the presenc e of limited data. Althoug h no t depicted here (see instead [36]), the ARD prior also informs us of the variable-order nature of this switching dynamica l proce ss. Wh en considering an HDP-V AR(2)-HMM with an ARD prior , the p osterior distribution of the ARD hyp erparameters for the first a nd second order la g compon ents associated with eac h of the three dominant inferred d ances clearly indica tes that two of the turning d ances simply rely o n the first lag compone nt while the other da nce relies on both lag compone nts. T o verify these resu lts, we provided the data and ground truth labe ls to MA TLAB’ s lpc impleme ntation of L evinson’ s a lgorithm, which indicated tha t the turning dance s are well approximated by an orde r 1 process, while the waggle da nce relies on an order 2 mode l. Thus, our learned orders for the three dances match wh at is indicated b y Le vinson’ s a lgorithm o n ground-truth segmented d ata. V . M O D E L V A R I A N T S There are many variants of the general S LDS and switching V AR mod els that are pervasi ve in the lit erature. One important example is w hen the d ynamic matrix is shared betwee n mode s; here, the d ynamics are instead distinguished based o n a s witching me an, such a s the Markov switching stoc hastic volatility (MSSV) mode l. In the maneuvering target tracking commun ity , it is often further assumed that the dy namic ma trix is shared and known (due to the und erstood physics of the target). W e explore both of thes e variants in the following sections . A. Shared Dynamic Matrix, Switching Dr iving Noise In ma ny applications, the dynamics of the switching process c an be desc ribed by a shared linear dyna mical system matrix A ; the dyna mics within a g i ven mode are the n de termined by so me external force a cting upon this LDS, and it is how this force is exerted that is mode-spe cific. The general form for su ch an SLDS is given by z t | z t − 1 ∼ π z t − 1 x t = A x t − 1 + e t ( z t ) y t = C x t + w t , (37) with proc ess and me asuremen t noise e t ( k ) ∼ N ( µ ( k ) , Σ ( k ) ) and w t ∼ N (0 , R ) , respectively . In this sce nario, the data are gen erated from one dynamic ma trix, A , and multiple proc ess no ise covariance matrices, Σ ( k ) . Thus , on e cannot place a MNIW prior jointly on these parameters (con ditioned on µ ( k ) ) due to the coupling o f the parameters in this prior . W e ins tead cons ider independen t priors on A , Σ ( k ) , and µ ( k ) . W e will refer to the ch oice of a normal prior on A , in verse-W ishart prior on Σ ( k ) , and normal prior on µ ( k ) as the N-IW -N prior . See [36] for details on deriving the resulting posterior distrib u tions given these indepen dent priors. Stochastic V o latility: An example of an SLDS in a s imilar form to that of Eq. (37) is the Marko v switching stochas tic v olatility (MSSV) model [5], [6], [38]. Th e MSSV assu mes that the log-volatil ities follo w an AR( 1 ) process with a Markov switching mean. This underlying process is observed via conditionally independe nt an d normally distrib uted daily returns. Specifically , let y t represent, for example, the d aily returns of a stock index. The state x t is then g i ven the interpretation o f log-volatili ties and the res ulting state s pace is giv e n b y [7] z t | z t − 1 ∼ π z t − 1 x t = ax t − 1 + e t ( z t ) y t = u t ( x t ) , (38) with e t ( k ) ∼ N ( µ ( k ) , σ 2 ) a nd u t ( x t ) ∼ N (0 , exp( x t )) . He re, o nly the mean of the process noise is mode -specific. Note, howev e r , that the me asurement equa tion is no n-linear in the state x t . Carvalho and Lope s [7] emp loy a particle filtering approac h to cop e with these non-linearities. In [6], the MSSV is instead modeled in the log-squared-da ily- returns do main s uch that log( y 2 t ) = x t + w t , (39) where w t is a dditi ve, non-Ga ussian noise. This n oise is so metimes ap proximated by a mome nt-matched Gaus - sian [39], while So et. a l. [6] use a mixture of Gauss ians approximation. Th e MSSV is then typ ically bes towed a fixed s et of two or three regimes of volatili ty . W e examine the IBO VESP A stock index (Sao P aulo Stock Exchang e) ov e r t he period of 01/03/1997 to 01/16/2001, during which ten key world ev e nts are cited in [7] as affecting the emer ging Brazilian market during this time p eriod. The key world ev e nts are summarized in T a ble IV a nd shown in the plots o f Fig. 7. Us e of this da taset was moti vated MIT LIDS TECHNICAL REPOR T #2830 17 Date Event 07/02/1 997 Thailand d ev alues the Baht by as m uch as 20% 08/11/1 997 IMF and Thailand set a rescue agreem ent 10/23/1 997 Hong Kongs stock index falls 1 0.4%. South K o rea won starts to weaken 12/02/1 997 IMF and South K orea set a bailout agreemen t 06/01/1 998 Russ ias stock mar ket crashes 06/20/1 998 IMF gi ves final approval to a lo an package to Russia 08/19/1 998 Russ ia officially falls in to default 10/09/1 998 IMF and W orld Bank joint me eting to discuss global econom ic cr isis. The Fed cuts in terest rates 01/15/1 999 The Braz ilian government allows its curren cy , the Real, to float freely by liftin g exchange co ntrols 02/02/1 999 Arminio Frag a is named President of Brazils Central Bank T ABLE IV T A B L E O F 1 0 K E Y W O R L D E V E N T S A FF E C T I N G T H E I B OV E S P A S T O C K I N D E X ( S AO P AU L O S T O C K E X C H A N G E ) O V E R T H E P E R I O D O F 0 1 / 0 3 / 1 9 9 7 T O 0 1 / 1 6 / 2 0 0 1 , A S C I T E D B Y C A RV A L H O A N D L O P E S [ 7 ] . by the work o f Carvalho an d Lo pes [7], in which a two-mode MSSV model is a ssumed. W e conside r a variant of the HDP-SLDS to match the MSSV mod el of Eq . (38). Spec ifically we examine log-squa red daily returns , a s in Eq. (39), a nd use a DP mixture of Ga ussians to mod el the mea surement noise: e t ( k ) ∼ N ( µ ( k ) , Σ ( k ) ) w t ∼ P ∞ ℓ =1 ω ℓ N (0 , R ℓ ) ω ∼ GE M ( σ r ) , R ℓ ∼ IW ( n r , S r ) . (40) W e truncate the me asuremen t noise DP mixtur e to 1 0 compone nts. For the HDP con centration hy perparameters, α , γ , a nd κ , we use the same p rior distributi ons as in Sec . IV -A-IV -C . For the dy namic parameters, we rely on the N-IW -N prior des cribed in Sec. V -A an d once again set the hyperparame ters o f this prior from statistics of the data as de scribed in the Append ix. S ince we a llo w for a mean on the proce ss noise and examine log-squ ared daily returns, we do not preproces s the data. The posterior probability of an HDP-SLDS inferred change point is shown in Fig. 7(a), and in Fig. 7(b) we display the corresponding plot for a non-sticky variant (i .e., with κ = 0 so that there is no bias towards mode self-transitions.) The HDP-SLDS is a ble to infer very similar c hange points to those prese nted in [7]. W ithout the sticky extension, the non-sticky mo del variant over-segments the data and rapidly switches be tween red undant s tates leading to many inferred chang e p oints that do not align with a ny world event. In Fig. 7(c), the overall c hange-po int detection performance of the HDP -SLDS is c ompared to tha t of the HDP-AR( 1 )-HMM, HDP-AR( 2 )-HMM, and non-sticky HDP-SLDS. The R OC curves shown are c alculated by windowing the time a xis and taking the ma ximum probability o f a change p oint in each window . These probabilities are then used as the c onfiden ce of a ch ange point in tha t window . F rom this plot, we cle arly see the advantage of using an SLDS model c ombined with the s ticky HDP-HMM prior on the mode se quence . W e also a nalyzed the pe rformance of an H DP-SLDS as d efined in T able I. W e u sed raw daily-return o bservations, and first pre-proce ssed the data in the s ame manner as the h oney be e data by c entering the ob servations a round 0 and sc aling the data to b e r oughly within a [ − 10 , 10] dynamic range. W e the n took a MNIW pri or on the dynamic paramete rs, as outlined in the Appen dix. Overall, althou gh the s tate of this HDP-SLDS does not have the interpretation of log-v olatilities, we see are still able to capture regime-changes in the dynamics of this stock index and find change points that align better with the true world events than in the MSS V HDP-SLDS mo del. B. F ixed Dy namic Matrix, Switching Driving Noise There are some cases in which the dynamical mode l is well-defined throug h knowledge of the physics of the system being observed, such as simple kinematic motion. More complicated mo tions can typica lly be mode led using the same fixed dyna mical model, but using a more complex de scription of the driving force. A generic LDS dri ven by an un known con trol input u t can be represented as x t = A x t − 1 + B u t + v t y t = C x t + D u t + w t , (41) where v t ∼ N (0 , Q ) and w t ∼ N (0 , R ) . It is o ften appropriate to assume D = 0 , as we do herein. MIT LIDS TECHNICAL REPOR T #2830 18 0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 Probability of Change Point Day Index 0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 Probability of Change Point Day Index 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False Alarm Rate Detection Rate HDP−SLDS HDP−SLDS, non−sticky HDP−AR(1)−HMM HDP−AR(2)−HMM (a) (b) (c) 0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 Probability of Change Point Day Index 0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 Probability of Change Point Day Index 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False Alarm Rate Detection Rate HDP−SLDS HDP−SLDS, non−sticky HDP−AR(1)−HMM HDP−AR(2)−HMM (d) (e) (f) Fig. 7. (a) Plot of the estimated probability of a change point on each day using 3,000 Gibbs samples f or a MSSV variant of the HDP-S LDS using a shared dynamic matrix and allo wi ng a mean on t he mode-specific process noise and a mixture of Gaussian measurement noise model. The observ ations are log-squared dialy return measurements, and the 10 key even ts are indicated with red lines. (b) Si milar plot for the non-stick y HDP-S LDS with no bias to wards self -transitions. (c) R OC curves for the HDP-SLDS, non-sticky H DP-SLDS, HDP-AR( 1 )-HMM, and HDP-AR( 2 )-HMM. (d)-(f) Analogou s plots for the HDP-SLDS of T able I using raw daily return measureme nts. Maneuve ring T arget T rac k ing: T arget tracking provides a n application domain in which on e o ften as sumes tha t the d ynamical model is kn own. One method of desc ribing a maneu vering tar get is to cons ider the con trol input as a rand om proc ess [40]. For exa mple, a jump-mean Markov proces s [41] yields dy namics desc ribed as z t | z t − 1 ∼ π z t − 1 x t = A x t − 1 + B u t ( z t ) + v t y t = C x t + w t u t ( k ) ∼ N ( µ ( k ) , Σ ( k ) ) v t ∼ N (0 , Q ) w t ∼ N (0 , R ) . (42) Classical app roaches rely on defining a fixed set of d ynamical mode s and assoc iated trans ition distributions. The state dyna mics o f Eq. (42) ca n be eq ui valently described as x t = A x t − 1 + e t ( z t ) (43) e t ( k ) ∼ N ( B µ ( k ) , B Σ ( k ) B T + Q ) . (44) This model can b e c aptured by o ur HDP-SLDS formulation o f Eq. (37) with a fixed dyn amic ma trix (e. g., constant velocity or cons tant acce leration models [40]) and mo de-specific , non-zero me an process noise . Su ch a formulation was explored in [ 9] a long with e xperiments that compare the p erformance to that of stan dard multiple model techniques , de monstrating the flexibil ity of the Bayesian no nparametric approa ch. Fox et. a l. [9] also present an alternati ve sampling scheme that ha rnesses the fact that the con trol input may be much lower -dimensional than the state and sequentially block-samples ( z t , u t ) analytically mar g inalizing over the state sequenc e x 1: T . Note that this variant of the HD P-SLDS can be viewed as an extension of the work b y Caron et. al. [42] in which the exogen ous input is modeled as an indepen dent n oise proce ss (i.e., no Markov structure on z t ) gene rated from a DP mixture model. V I . C O N C L U S I O N In this p aper , we h av e a ddressed the prob lem of learning switching linear dyna mical mode ls with an un known number of modes for d escribing complex d ynamical p henomen a. W e presented a Baye sian nonpa rametric approac h and demo nstrated both the u tility and versatility o f the developed HDP-SLDS an d HDP-AR-HMM on real app li- cations. Using the same parameter settings, a lthough different model choices , in one ca se we are a ble to learn MIT LIDS TECHNICAL REPOR T #2830 19 change s in the volatili ty o f the IBO VESP A stoc k exchang e wh ile in an other case we learn segme ntations of d ata into waggle , turn -right , and tur n-left h oney bee da nces. W e a lso desc ribed a me thod of a pplying automa tic relev a nce determination (ARD) as a sparsity-inducing prior , lea ding to flexible and sca lable dynamical models that allow for identification o f variable order structure. W e conc luded by con sidering a daptations of the H DP-SLDS to s pecific forms often examined in the literature suc h a s the Markov switching stochas tic volatility model and a standard multiple model tar get track ing formulation. The ba tch processing of the Gibbs samp lers deri ved h erein may be impractical a nd offline-training online-tracking infeasible for c ertain ap plications. Due bo th to the nonlinea r dyna mics and uncertainty in mo del pa rameters, exact recursiv e estimation is infeas ible. One could le verag e the c onditionally linear dyna mics and use Rao-Blackwellized particle fi ltering (RBPF) [43]. Howe ver , on e c hallenge is that such p article fi lters c an s uff er from a progress i vely impoverished particle repres entation. Overall, the formulation we dev eloped herein represents a flexible, Bayesian nonparametric model for desc ribing complex d ynamical phe nomena an d discovering s imple underlying tempo ral s tructures. A P P E N D I X a) MNIW Gen eral Method: For the experiments of Sec. IV -A, we set M = 0 and K = I m . This choice ce nters the mass of the prior around stable dynamic matrices while allowing for considerab le variabili ty . T he in verse-W ishart portion is given n 0 = m + 2 degrees of freedom. For the HDP-AR-HMM, the scale ma trix S 0 = 0 . 75 ¯ Σ , where ¯ Σ = 1 T P ( y t − ¯ y )( y t − ¯ y ) T . Setting the prior directly from the data c an help move the ma ss of the d istrib ution to reason able values of the paramete r spac e. For an HDP -SLDS with x t ∈ R n and y t ∈ R d and n = d , we set S 0 = 0 . 675 ¯ Σ . W e then set the in verse-W isha rt prior on the measu rement noise, R , to have r 0 = d + 2 and R 0 = 0 . 075 ¯ Σ . For n > d , s ee [36]. b) P artially Supe rvised Ho ney Bee Expe riments: For the pa rtially supe rvised experiments of Sec . IV -C, we set Σ 0 = 0 . 75 S 0 . Sinc e we are not shifting and scaling the obs ervations, we set S 0 to 0.7 5 times the empirical covari ance of the first differ ence observations. W e also use n 0 = 10 , making the distrib u tion tighter than in the unsupe rvised case. E xamining first d if feren ces is appropriate since the bee ’ s dyna mics are better approx imated as a rando m walk than as i.i.d. observati ons. Using raw obs ervations in the unsupe rvised approa ch creates a lar ger expected covari ance ma trix making the prior on the dynamic matrix les s informativ e , which is useful in the absence of other labeled data. c) IBO VESP A Stoc k Index Ex periments: For the HDP -SLDS vari a nt of the MSSV mode l of Eq. (38), we rely on the N-IW -N prior desc ribed in Se c. V -A. For the dy namic p arameter a and proc ess noise mean µ ( k ) , we use N (0 , 0 . 75 ¯ Σ ) priors. The IW prior on Σ ( k ) was given 3 degrees o f freedom an d an expec ted value of 0 . 7 5 ¯ Σ . Finally , each compo nent of the mixture-of-Gauss ian me asuremen t noise was giv en an IW prior with 3 degrees of freedom and a n expected value of 5 ∗ π 2 , wh ich matche s with the momen t-matching techniqu e of Ha rvey et. al. [39]. For the HDP-AR(r)-HMM’ s to which we comp are in Fig. 7, we place a zero-mean normal prior on the dynamic parame ter a with cov a riance s et to the expected noise covariance, which in this case is equal to 0.75 times the empirical covari ance plus 5 ∗ π 2 . The mean parameter µ ( k ) is defi ned as in the HDP-SLDS. For the HDP-SLDS co mparison u sing the mo del of T able I, we us e a MNIW prior with M = 0 , K = 1 , n 0 = 3 , and S 0 = 0 . 75 ¯ Σ . The IW prior o n R was gi ven r 0 = 100 and a n expe cted covariance o f 25 . Our sampler initiali zes parameters from the prior , and we found it useful to se t the p rior around large values of R in order to av oid initial samples chattering b etween dynamical re gimes caused b y the state s equen ce having to accou nt for the noise in the observations. After accou nting for the residu als of the d ata in the posterior distributi on, we typic ally lea rned R ≈ 10 . R E F E R E N C E S [1] E. Fox, E. Sudderth , M. Jordan, and A. Will sky , “Nonpar ametric Bayesian learnin g of switching dynamical systems, ” in Advance s in Neu ral I nformation Pr ocessing Systems , vol. 21, 200 9, pp. 457–4 64. [2] ——, “Nonp arametric Bay esian identification o f jump systems with sparse dep endencie s, ” in Pr oc. 1 5th IF AC Sympo sium on System Ide ntification , Ju ly 200 9. [3] V . Pa vlovi ´ c, J. Rehg, and J. MacCormick, “Learnin g switching line ar models of hu man motion, ” in Advan ces in Neural Information Pr ocessing Sy stems , vol. 1 3, 2001 , p p. 981–98 7. MIT LIDS TECHNICAL REPOR T #2830 20 [4] L. Ren, A . Patrick, A. Efros, J. Hodgins, and J. Rehg, “ A data-d riv en ap proach to q uantifyin g natural human m otion, ” in SIGGRAPH , August 200 5. [5] C.-J. Kim, “Dynamic linear m odels with Markov-switching, ” Journal of Econ ometrics , vol. 60 , pp. 1–22 , 199 4. [6] M. So, K. Lam , and W . Li, “ A stochastic volatility mo del with Markov switchin g, ” Journal of Bu siness & E conomic Statistics , vol. 16, no. 2, pp. 244 –253 , 1 998. [7] C. Carvalho and H. Lopes, “Simulation- based sequ ential analysis of Markov switchin g stochastic v o latility models, ” Computation al Sta tistics & Data Analysis , vol. 51, pp. 4526 –454 2, 9 2007. [8] X. Ron g Li a nd V . Jilkov , “Sur vey of man euvering target trackin g. Part V: Mu ltiple-mod el metho ds, ” IEEE T ransactio ns on Aer ospa ce and Electr onic Systems , vol. 41, no. 4, pp. 12 55–1 321, 20 05. [9] E. Fox, E. Sud derth, and A. Will sky , “Hier archical Dirichlet pr ocesses for track ing maneu vering targets, ” in Pr oc. Internation al Conference o n Information Fusion , Ju ly 200 7. [10] S. Oh, J. Rehg , T . Balch, and F . De llaert, “Learning and inferrin g mo tion pa tterns u sing parametric segmen tal switching linear dynam ic system s, ” I nternation al Journal of Computer V ision , vol. 7 7, no. 1–3, pp. 10 3–124 , 20 08. [11] X. Xuan a nd K. Mur phy , “Mode ling chan ging d epende ncy structure in mu ltiv ariate time series, ” in Pr o c. Interna tional Confer en ce on Machine Learning , June 2007 . [12] Y . T eh, M. Jordan, M. Beal, and D. B lei, “Hierar chical Dirichlet processes, ” J o urnal of the American Statistical Association , vol. 1 01, no. 47 6, pp. 1566– 1581, 200 6. [13] M. Beal, Z. Ghahram ani, and C. Rasmussen, “T he in finite hidden Markov m odel, ” in Ad vances in Neural Information Pr ocessing Systems , vol. 14, 200 2, pp. 577–58 4. [14] E. Fox, E. Sudderth , M. Jordan, and A. W illsky , “ An HDP-HMM for systems with state per sistence, ” in Pr oc. Interna tional Confer en ce on Machine Learning , July 2008 . [15] D. MacKay , Bayesian metho ds for backpr op networks , ser . Mo dels of Neural Networks, III. Spr inger, 1994 , c h. 6, pp . 211–2 54. [16] R. Neal, Ed ., Bayesian Learning for Neural Networks , ser . Lecture Notes in Statistics. Springer, 1 996, vol. 118. [17] M. Beal, “V ariational algo rithms fo r appro ximate bayesian in ference, ” Ph.D. The sis, University Co llege Londo n, London , UK, 200 3. [18] S. Paoletti, A. Juloski, G . Fer rari-Trecate, and R. V ida l, “I dentification of h ybrid systems: A tutorial. ” Eur opea n Journal of Contr ol , vol. 2–3, pp. 242– 260, 2007. [19] R. V idal, S. Soatto, Y . Ma, an d S. Sastry , “ An alg ebraic g eometric appro ach to the iden tification of a class of linear hybrid systems. ” in Pr oc. IEEE Con fer enc e on Decision and Contr ol , Decem ber 2003. [20] Z. Psarad akis and N. Spagnolo , “Join t determ ination of the state dimen sion and autoregressive ord er for models with Markov r egime switchin g, ” Journal of T ime Series Ana lysis , v o l. 27, pp . 7 53–7 66, 2 006. [21] K. Huang , A. W agner, a nd Y . Ma , “Identification of hybrid linea r time-inv arian t systems via subspace emb edding and segmentation SES. ” in Pr oc. IEE E Con fer enc e on Decision and Contr ol , Decem ber 2004 . [22] R. V idal, Y . Ma, and S. Sastry , “Gen eralized princip al co mpon ent analy sis (GPCA): Subspace clusterin g by po lynomial factorization, differentiation, and division. ” UC Berkele y , T echnical Report UCB/ERL. , August 200 3. [23] G. K o tsalis, A. Megretski, an d M. Dah leh, “Model reductio n of discrete-tim e Markov jump linear systems, ” in Pr o c. American Contr ol Conference , June 2006. [24] B. Anderso n, “The realization problem for hidd en Markov models, ” Mathema tics of Contr ol, Signa ls, and Systems , vol. 12, pp. 80–1 20, 1999. [25] M. Petr eczky and R. V idal, “Realization theo ry of sto chastic jum p-Markov linear systems, ” in Pr oc. IEEE Confer en ce on Decision and Contr o l , December 2007. [26] Z. Ghahra mani an d G. Hinton , “V ariational learning for switchin g state-space models, ” Neural Compu tation , vol. 12 , no. 4, pp. 83 1–864 , 2000. [27] R. V idal, A. Chiu so, , and S. Soa tto, “Obser vability and identifiability of jump linear systems, ” in Pr o c. I EEE Conference on Decision and Contr o l , December 2002. [28] H. Akaike, “ A new loo k at the statistical mod el id entification, ” I EEE T ransactio ns o n Automatic Contr o l , vol. 19, no. 6, pp. 716– 723, 1 974. [29] G. Schwarz, “Estimating the dim ension of a model, ” Th e Anna ls of Statistics , pp. 461– 464, 19 78. [30] M. Aok i and A. Havenner , “State space mode ling of multip le time ser ies, ” Econo metric Review s , vol. 10, n o. 1, p p. 1–5 9, 1991. [31] L. Rab iner , “ A tuto rial on hidden Markov m odels and selected application s in speech rec ognition , ” Pr oceedin gs of th e MIT LIDS TECHNICAL REPOR T #2830 21 IEEE , vol. 77, no. 2, pp. 257 –286 , 1989. [32] J. Sethuram an, “ A constru ctiv e definition of Dirichlet prio rs, ” Sta tistica Sinica , vol. 4, pp. 639–6 50, 1 994. [33] D. Black well and J. MacQue en, “Ferguson distributions via Polya u rn schem es, ” The Ann als of Statistics , vol. 1, no . 2, pp. 353– 355, 1 973. [34] M. W est and J. Harrison , Bayesian F or ecasting and Dynamic Mod els . Springer, 1997 . [35] O. Costa, M. Frago so, and R. Marqu es, Discrete-T ime Ma rkov J ump Linear Systems . Spr inger, 20 05. [36] E. Fox, “Bayesian non parametric lear ning of complex dynamical phenom ena, ” Ph.D. dissertation, MIT , July 2009 . [37] C. Carter and R. K o hn, “Markov chain Monte Carlo in condition ally Gaussian state space models, ” B iometrika , vol. 83 , pp. 589– 601, 3 1996 . [38] J. Hamilton, “ A new approach to the economic analysis of nonstationa ry time s eries and the business cycle, ” Econometrica , vol. 5 7, no. 2, pp. 35 7–38 4, 1989 . [39] A. Har vey , E. Ruiz, and N. Shephar d, “Mu lti variate stoch astic variance mod els, ” Review of Economic Stud ies , vol. 6 1, pp. 247– 264, 1 994. [40] X. Rong Li and V . Jilkov , “Survey of maneuvering target track ing. Part I: Dynam ic mode ls, ” IEEE T ransactions o n Aer ospace an d Electr onic Systems , vol. 39, no. 4, pp. 133 3–13 64, 200 3. [41] R. M oose, H. V anLanding ham, and D. McCabe , “Mod eling and e stimation o f track ing mane uvering targets, ” IE EE T ransactions on Aer ospace and Electr on ic Systems , vol. 1 5, no. 3, pp. 4 48–45 6, 19 79. [42] F . Caron, M. Davy , A. Dou cet, E. Duflos, and P . V anheeghe, “Bayesian inference f or d ynamic mo dels with Dirich let process mixture s, ” in Pr oc. Internation al Conference on Information Fusion , July 2006. [43] A. Douc et, N. de Freitas, K. Mu rphy , and S. Ru ssell. MIT LIDS TECHNICAL REPOR T #2830 22 A P P E N D I X A D Y NA M I C P A R A M E T E R P O S T E R I O R S In this appendix, w e derive the p osterior distribution over the d ynamic paramete rs of a switching V AR( r ) p rocess defined as follo w s: y t = r X i =1 A ( z t ) i y t − i + e t ( z t ) e t ( k ) ∼ N ( µ ( k ) , Σ ( k ) ) , (45) where z t indexes the mode-spe cific V AR( r ) process at time t . Assume that the mode seque nce { z 1 , . . . , z T } is known and we wish to compute the posterior distribution of the k th mode’ s V AR( r ) paramete rs A ( k ) i for i = 1 , . . . , r and Σ ( k ) . Let { t 1 , . . . , t n k } = { t | z t = k } . The n, we may write h y t 1 y t 2 . . . y t n k i = h A ( k ) 1 A ( k ) 2 . . . A ( k ) r i y t 1 − 1 y t 2 − 1 . . . y t n k − 1 y t 1 − 2 y t 2 − 2 . . . y t n k − 2 . . . y t 1 − r y t 2 − r . . . y t n k − r + e t 1 e t 2 . . . e t n k . (46) W e defin e the following notation for Eq. (46): Y ( k ) = A ( k ) ¯ Y ( k ) + E ( k ) , (47) and let D ( k ) = { Y ( k ) , ¯ Y ( k ) } . In the follo w ing sections, we c onsider two possible priors on the dyn amic parameter . In Ap pendix A- A, we ass ume that µ ( k ) is 0 for all k and co nsider the conjuga te matrix-normal in verse-W ishart (MNIW) prior f or { A ( k ) , Σ ( k ) } . In Appendix A- B, we c onsider the more g eneral form of Eq . (45) and take independ ent priors o n A ( k ) , Σ ( k ) , and µ ( k ) . A. Conjugate Pr ior — MNIW T o s how conjug acy , we place a MNIW prior on the dynamic pa rameters { A ( k ) , Σ ( k ) } and s how tha t the po sterior remains MNIW giv en a set of d ata from the mode l of Eq. (45) (assuming µ ( k ) = 0 ). The MNIW prior is given by placing a ma trix-normal prior MN A ( k ) ; M , Σ ( k ) , K on A ( k ) giv e n Σ ( k ) : p ( A ( k ) | Σ ( k ) ) = | K | d/ 2 | 2 π Σ ( k ) | m/ 2 exp − 1 2 tr (( A − M ) T Σ − ( k ) ( A − M ) K ) (48) and an in verse-W ishart prior IW ( n 0 , S 0 ) on Σ ( k ) : p (Σ ( k ) ) = | S 0 | n 0 / 2 | Σ ( k ) | − ( d + n 0 +1) / 2 2 n 0 d/ 2 Γ d ( n 0 / 2) exp − 1 2 tr (Σ − ( k ) S 0 ) (49) where Γ d ( · ) is the mu lti variate gamma function a nd B − ( k ) denotes ( B ( k ) ) − 1 for so me ma trix B . W e first ana lyze the li kelihood of the data, D ( k ) , given the k th mode’ s dynamic parameters, { A ( k ) , Σ ( k ) } . Starting with the fact that each obs ervation vector , y t , is conditionally Gaus sian given the lag observations, ¯ y t = [ y T t − 1 . . . y T t − r ] T , we h av e p ( D ( k ) | A ( k ) , Σ ( k ) ) = 1 | 2 π Σ ( k ) | n k / 2 exp − 1 2 X i ( y t i − A ( k ) ¯ y t i ) T Σ − ( k ) ( y t i − A ( k ) ¯ y t i ) ! = 1 | 2 π Σ ( k ) | n k / 2 exp − 1 2 tr (( Y ( k ) − A ( k ) ¯ Y ( k ) ) T Σ − ( k ) ( Y ( k ) − A ( k ) ¯ Y ( k ) ) I ) = MN Y ( k ) ; A ( k ) ¯ Y ( k ) , Σ ( k ) , I . (50) T o derive the posterior of the dy namic parameters, it is useful to first co mpute p ( D ( k ) , A ( k ) | Σ ( k ) ) = p ( D ( k ) | A ( k ) , Σ ( k ) ) p ( A ( k ) | Σ ( k ) ) . (51) MIT LIDS TECHNICAL REPOR T #2830 23 Using the f ac t that both the likelihood p ( D ( k ) | A ( k ) , Σ ( k ) ) and the prior p ( A ( k ) | Σ ( k ) ) are matri x-normally distrib u ted s haring a commo n parameter Σ ( k ) , we h av e log p ( D ( k ) , A ( k ) | Σ ( k ) ) + C = − 1 2 tr (( Y ( k ) − A ( k ) ¯ Y ( k ) ) T Σ − ( k ) ( Y ( k ) − A ( k ) ¯ Y ( k ) ) + ( A ( k ) − M ) T Σ − ( k ) ( A ( k ) − M ) K ) = − 1 2 tr (Σ − ( k ) { A ( k ) S ( k ) ¯ y ¯ y A ( k ) T − 2 S ( k ) y ¯ y A ( k ) T + S ( k ) y y } ) = − 1 2 tr (Σ − ( k ) { ( A ( k ) − S ( k ) y ¯ y S − ( k ) ¯ y ¯ y ) S ( k ) ¯ y ¯ y ( A ( k ) − S ( k ) y ¯ y S − ( k ) ¯ y ¯ y ) T + S ( k ) y | ¯ y } ) , (52) where we have used the definitions: C = − log 1 | 2 π Σ ( k ) | n k / 2 | K | d/ 2 | 2 π Σ ( k ) | r n k / 2 S ( k ) y | ¯ y = S ( k ) y y − S ( k ) y ¯ y S − ( k ) ¯ y ¯ y S ( k ) T y ¯ y , S ( k ) ¯ y ¯ y = ¯ Y ( k ) ¯ Y ( k ) T + K S ( k ) y ¯ y = Y ( k ) ¯ Y ( k ) T + M K S ( k ) y y = Y ( k ) Y ( k ) T + M K M T . Conditioning on the noise covariance Σ ( k ) , we s ee that the dynamic matrix posterior is given by: p ( A ( k ) | D ( k ) , Σ ( k ) ) ∝ exp − 1 2 tr (( A ( k ) − S ( k ) y ¯ y S − ( k ) ¯ y ¯ y ) T Σ − ( k ) ( A ( k ) − S ( k ) y ¯ y S − ( k ) ¯ y ¯ y ) S ( k ) ¯ y ¯ y ) = MN A ( k ) ; S ( k ) y ¯ y S − ( k ) ¯ y ¯ y , Σ ( k ) , S ( k ) ¯ y ¯ y . (53) Mar ginalizing E q. (52) over the d ynamic matrix A ( k ) , we d eri ve p ( D ( k ) | Σ ( k ) ) = Z A ( k ) p ( D ( k ) , A ( k ) | Σ ( k ) ) d A ( k ) = | K | d/ 2 | 2 π Σ ( k ) | n k / 2 exp − 1 2 tr (Σ − ( k ) S ( k ) y | ¯ y ) Z A ( k ) 1 | S ( k ) ¯ y ¯ y | d/ 2 MN A ( k ) ; S ( k ) y ¯ y S − ( k ) ¯ y ¯ y , Σ ( k ) , S ( k ) ¯ y ¯ y d A ( k ) = | K | d/ 2 | 2 π Σ ( k ) | n k / 2 | S ( k ) ¯ y ¯ y | d/ 2 exp − 1 2 tr (Σ − ( k ) S ( k ) y | ¯ y ) , (54) which leads us to our fina l re sult of the covariance having a n in verse-W isha rt marginal p osterior distrib ution: p (Σ ( k ) | D ( k ) ) ∝ p ( D ( k ) | Σ ( k ) ) p (Σ ( k ) ) ∝ | K | d/ 2 | 2 π Σ ( k ) | n k / 2 | S ( k ) ¯ y ¯ y | d/ 2 exp − 1 2 tr (Σ − ( k ) S ( k ) y | ¯ y ) | Σ ( k ) | − ( d + n 0 +1) / 2 exp − 1 2 tr (Σ − ( k ) S 0 ) ∝ | Σ ( k ) | − ( d + n k + n 0 +1) / 2 exp − 1 2 tr (Σ − ( k ) ( S ( k ) y | ¯ y + S 0 )) = IW ( n k + n 0 , S ( k ) y | ¯ y + S 0 ) . (55) B. Non-Conjugate Indepe ndent Priors on A ( k ) , Σ ( k ) , and µ ( k ) In this section, we provide the d eri vations for the p osterior distributions of A ( k ) , Σ ( k ) , a nd µ ( k ) when e ach of these pa rameters is giv en an independ ent prior . One examp le of a non-conjugate prior is o ur prop osed ARD sparsity-inducing prior . MIT LIDS TECHNICAL REPOR T #2830 24 a) Normal Prior on A ( k ) : Assume we plac e a Gaus sian prior , N ( µ A , Σ A ) , on the vectorization of the matrix A ( k ) , wh ich we de note by vec ( A ( k ) ) . T o examine the pos terior distributi on, we first a im to write the data as a linear function o f vec ( A ( k ) ) . W e ma y rewri te Eq. (45) a s y t = A ( k ) y T t − 1 y T t − 2 . . . y T t − r T + e t ∀ t | z t = k , A ( k ) ¯ y t + e t ( k ) . (56) Recalling that r is the autoregressive orde r and d the dimension of the obs ervation vector y t , we can equ i valently represent the a bove as y t = ¯ y t, 1 ¯ y t, 2 · · · ¯ y t,d ∗ r 0 0 · · · 0 0 0 · · · 0 0 0 · · · 0 ¯ y t, 1 ¯ y t, 2 · · · ¯ y t,d ∗ r 0 0 · · · 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 · · · 0 0 0 · · · 0 ¯ y t, 1 ¯ y t, 2 · · · ¯ y t,d ∗ r a ( k ) 1 , 1 a ( k ) 1 , 2 . . . a ( k ) 1 ,d ∗ r a ( k ) 2 , 1 a ( k ) 2 , 2 . . . a ( k ) d,d ∗ r + e t ( k ) = ¯ y t, 1 I d ¯ y t, 2 I d · · · ¯ y t,d ∗ r I d vec ( A ( k ) ) + e t ( k ) , ¯ Y t vec ( A ) + e t ( k ) . (57) Here, the columns o f ¯ y t are pe rmutations of those o f the matrix in the fi rst line suc h that we may write y t as a function of vec ( A ( k ) ) . Noting that e t ( k ) ∼ N ( µ ( k ) , Σ ( k ) ) , log p ( D ( k ) , A ( k ) | Σ ( k ) , µ ( k ) ) = C − 1 2 X t | z t = k ( y t − µ ( k ) − ¯ Y t vec ( A ( k ) )) T Σ − ( k ) ( y t − µ ( k ) − ¯ Y t vec ( A ( k ) )) − 1 2 ( vec ( A ( k ) ) − m A ) T Σ − 1 A ( vec ( A ( k ) ) − m A ) , (58) which can be rewrit ten a s, log p ( D ( k ) , A ( k ) | Σ ( k ) , µ ( k ) ) = C − 1 2 vec ( A ( k ) ) T Σ − 1 A + X t | z t = k ¯ Y T t Σ − ( k ) ¯ Y t vec ( A ( k ) ) + vec ( A ( k ) ) T Σ − 1 A m A + X t | z t = k ¯ Y T t Σ − ( k ) ( y t − µ ( k ) ) − 1 2 m T A Σ − 1 A m A − 1 2 X t | z t = k ( y t − µ ( k ) ) T Σ − ( k ) ( y t − µ ( k ) ) (59) Conditioning on the data, we arri ve at the desired p osterior distrib ution log p ( A ( k ) | D ( k ) , Σ ( k ) , µ ( k ) ) = C − 1 2 vec ( A ( k ) ) T (Σ − 1 A + X t | z t = k ¯ Y T t Σ − ( k ) ¯ Y t ) vec ( A ( k ) ) − 2 vec ( A ( k ) ) T (Σ − 1 A m A + X t | z t = k ¯ Y T t Σ − ( k ) ( y t − µ ( k ) )) = N − 1 Σ − 1 A m A + X t | z t = k ¯ Y T t Σ − ( k ) ( y t − µ ( k ) ) , Σ − 1 A + X t | z t = k ¯ Y T t Σ − ( k ) ¯ Y t (60) MIT LIDS TECHNICAL REPOR T #2830 25 b) In verse W ishart Prior on Σ ( k ) : W e place an in verse-W ish art prior , IW ( n 0 , S 0 ) , on Σ ( k ) . Let n k = |{ t | z t = k , t = 1 , 2 , . . . , T }| . Co nditioned on A ( k ) and µ ( k ) , standard conjug acy results imply that the p osterior of Σ ( k ) is: p (Σ ( k ) | D ( k ) , A ( k ) , µ ( k ) ) = IW n k + n 0 , S + X t | z t = k ( y t − A ( k ) ¯ y t − µ ( k ) )( y t − A ( k ) ¯ y t − µ ( k ) ) T . (61) c) Normal Prior on µ ( k ) : Finally , we place a Ga ussian prior , N ( µ 0 , Σ 0 ) , on µ ( k ) . Conditioned on A ( k ) and Σ ( k ) , the posterior of µ ( k ) is: p ( µ ( k ) | D ( k ) , A ( k ) , Σ ( k ) ) = N − 1 µ ( k ) ; Σ − 1 0 µ 0 + Σ − ( k ) X t | z t = k ( y t − A ( k ) ¯ y t ) , Σ − 1 0 + n k Σ − ( k ) . (62) W e iterate between sampling A ( k ) , Σ ( k ) , and µ ( k ) many times before moving on to the next step of the Gibbs sampler . A P P E N D I X B S PA R S I T Y - I N D U C I N G P R I O R S F O R I N F E R R I N G V A R I A B L E O R D E R M O D E L S Recall Sec . III-A 2 and the proposed automatic relev an ce de termination (ARD) prior for inferring non-dyn amical compone nts of the s tate vector in the ca se of the HDP-SLDS o r lag c omponen ts in the HDP-AR-HMM by shrinking compone nts of the model parame ters to ze ro. Howe ver , if we would like to ensure that o ur choice o f C = [ I d 0] doe s not interfere with learning a s parse realization if o ne exists, we must restrict ourselves to c onsidered a con strained class of dynamica l phe nomenon . For exa mple, imag ine a realization of an LDS with ˜ A = 0 . 8 0 0 . 2 0 , ˜ C = 1 1 . Then, the trans formation to C = [1 0] lead s to A = T − 1 ˜ AT = 0 . 5 1 0 . 15 0 . 3 , for T = 0 . 5 1 0 . 5 − 1 . So, for this example, fixing C = [1 0] would not lead to learning a sparse dy namical ma trix A . Criterion 3.1 provides a s et of sufficient, tho ugh n ot nec essary , co nditions for maintaining the sp arsity within each A ( k ) when transforming to the realization with C = [ I d 0] . That is, gi ven there exists a realization R 1 of o ur d ynamical phenome na that s atisfies Criterion 3.1, the transformation T to an equiv alen t realization R 2 with C = [ I d 0] will maintain the spars ity s tructure se en in R 1 , which we aim to infer with the ARD prior . Criterion 3 .1, wh ich states that the o bserved state vector c omponen ts are a su bset of those re lev ant to all mod es, is rea sonab le for many applications: we often hav e observati ons only of components of the s tate vector that are e ssential to all modes while some modes may h av e a dditional co mponents that affect the dyn amics, but a re not directly observed. T o clarify the con ditions of Criterion 3.1, c onsider a 3-mode S LDS rea lization R with A (1) = h a (1) 1 a (1) 2 a (1) 3 0 0 i A (2) = h a (2) 1 a (2) 2 0 a (2) 4 0 i A (3) = h a (3) 1 a (3) 2 a (3) 3 0 a (3) 5 i , (63) then the observation matrix mu st be of the form C = c 1 c 2 0 0 0 to satisfy Criterion 3.1. A P P E N D I X C H D P - S L D S A N D H D P - A R - H M M M E S S AG E P A S S I N G In this a ppendix, we explore the comp utation of the b ackwards messag e p assing a nd forward s ampling s cheme used for generating samples of the mo de seq uence z 1: T and state s equen ce x 1: T . MIT LIDS TECHNICAL REPOR T #2830 26 A. Mode Seq uence Message P assing for Blocked Samp ling Consider a switching V AR( r ) proce ss. T o deri ve the forward-backward proce dure for jointly sa mpling the mode sequen ce z 1: T giv e n observations y 1: T , p lus r initial obse rvati ons y 1 − r : 0 , we first n ote that the chain rule and Markov structure a llo w s us to dec ompose the joint distribution as follo ws: p ( z 1: T | y 1 − r : T , π , θ ) = p ( z T | z T − 1 , y 1 − r : T , π , θ ) p ( z T − 1 | z T − 2 , y 1 − r : T , π , θ ) · · · p ( z 2 | z 1 , y 1 − r : T , π , θ ) p ( z 1 | y 1 − r : T , π , θ ) . (64) Thus, we may first sample z 1 from p ( z 1 | y 1 − r : T , π , θ ) , the n condition on this v alue to sample z 2 from p ( z 2 | z 1 , y 1 − r : T , π , θ ) , and s o on. Th e conditional dis trib ution of z 1 is derived as: p ( z 1 | y 1 − r : T , π , θ ) ∝ p ( z 1 ) p ( y 1 | θ z 1 , y 1 − r : 0 ) X z 2: T Y t p ( z t | π z t − 1 ) p ( y t | θ z t , y t − r : t − 1 ) ∝ p ( z 1 ) p ( y 1 | θ z 1 , y 1 − r : 0 ) X z 2 p ( z 2 | π z 1 ) p ( y 2 | θ z 2 , y 2 − r : 1 ) m 3 , 2 ( z 2 ) ∝ p ( z 1 ) p ( y 1 | θ z 1 , y 1 − r : 0 ) m 2 , 1 ( z 1 ) , (65) where m t,t − 1 ( z t − 1 ) is the ba ckward mes sage passe d from z t to z t − 1 and is recursively defined by: m t,t − 1 ( z t − 1 ) ∝ P z t p ( z t | π z t − 1 ) p ( y t | θ z t , y t − r : t − 1 ) m t +1 ,t ( z t ) , t ≤ T ; 1 , t = T + 1 . (66) The gen eral conditional distribution of z t is: p ( z t | z t − 1 , y 1 − r : T , π , θ ) ∝ p ( z t | π z t − 1 ) p ( y t | θ z t , y t − r : t − 1 ) m t +1 ,t ( z t ) . (67) For the HDP-AR-HMM, thes e distributi ons are given by: p ( z t = k | z t − 1 , y 1 − r : T , π , θ ) ∝ π z t − 1 ( k ) N ( y t ; r X i =1 A ( k ) i y t − i , Σ ( k ) ) m t +1 ,t ( k ) m t +1 ,t ( k ) = L X j =1 π k ( j ) N ( y t +1 ; r X i =1 A ( j ) i y t − i , Σ ( j ) ) m t +2 ,t +1 ( j ) m T + 1 , T ( k ) = 1 k = 1 , . . . , L. (68) B. State Se quence Message P assing for Blocked Sa mpling A similar s ampling s cheme is us ed for gene rating samples of the state s equen ce x 1: T . Althoug h we now have a continuous state s pace, the computation of the backwards mes sages m t +1 ,t ( x t ) is s till analytically feasible since we are working with Gaussia n de nsities. Assume , m t +1 ,t ( x t ) ∝ N − 1 ( x t ; θ t +1 ,t , Λ t +1 ,t ) , whe re N − 1 ( x ; θ , Λ) denotes a Gaussian d istrib ution on x in information form with mean µ = Λ − 1 θ and cov a riance Σ = Λ − 1 . Given a fixed mode sequen ce z 1: T , we simply hav e a time-varying linear dynamic s ystem. The backwards mes sages for the HDP-SLD S can be recursiv ely d efined by m t,t − 1 ( x t − 1 ) ∝ Z X t p ( x t | x t − 1 , z t ) p ( y t | x t ) m t +1 ,t ( x t ) d x t . (69) For this model, the state transition density of Eq. (69) can be expressed as p ( x t | x t − 1 , z t ) ∝ exp − 1 2 ( x t − A ( z t ) x t − 1 − µ ( z t ) ) T Σ − ( z t ) ( x t − A ( z t ) x t − 1 − µ ( z t ) ) (70) ∝ exp − 1 2 x t − 1 x t T A ( z t ) T Σ − ( z t ) A ( z t ) − A ( z t ) T Σ − ( z t ) − Σ − ( z t ) A ( z t ) Σ − ( z t ) x t − 1 x t + x t − 1 x t T − A ( z t ) T Σ − ( z t ) µ ( z t ) Σ − ( z t ) µ ( z t ) . MIT LIDS TECHNICAL REPOR T #2830 27 W e can similarly write the likelihood in exponen tiated quadratic form p ( y t | x t ) ∝ exp − 1 2 ( y t − C x t ) T R − 1 ( y t − C x t ) (71) ∝ exp − 1 2 x t − 1 x t T 0 0 0 C T R − 1 C x t − 1 x t + x t − 1 x t T 0 C T R − 1 y t , as well a s the messag es m t +1 ,t ( x t ) ∝ exp − 1 2 x T t Λ t +1 ,t x t + x T t θ t +1 ,t (72) ∝ exp − 1 2 x t − 1 x t T 0 0 0 Λ t +1 ,t x t − 1 x t + x t − 1 x t T 0 θ t +1 ,t . The produc t of these quadratics is giv e n by : p ( x t | x t − 1 , z t ) p ( y t | x t ) m t +1 ,t ( x t ) ∝ exp − 1 2 x t − 1 x t T A ( z t ) T Σ − ( z t ) A − A ( z t ) T Σ − ( z t ) − Σ − ( z t ) A ( z t ) Σ − ( z t ) + C T R − 1 C + Λ t +1 ,t x t − 1 x t + x t − 1 x t T − A ( z t ) T Σ − ( z t ) µ ( z t ) C T R − 1 y t + Σ − ( z t ) µ ( z t ) + θ t +1 ,t (73) Using stand ard Ga ussian marginalization identities we integrate over x t to get, m t,t − 1 ( x t − 1 ) ∝ N − 1 ( x t − 1 ; θ t,t − 1 , Λ t,t − 1 ) , (74) where, θ t,t − 1 = − A ( z t ) T Σ − ( z t ) µ ( z t ) + A ( z t ) T Σ − ( z t ) (Σ − ( z t ) + C T R − 1 C + Λ t +1 ,t ) − 1 ( C T R − 1 y t + Σ − ( z t ) µ ( z t ) + θ t +1 ,t ) Λ t,t − 1 = A ( z t ) T Σ − ( z t ) A ( z t ) − A ( z t ) T Σ − ( z t ) (Σ − ( z t ) + C T R − 1 C + Λ t +1 ,t ) − 1 Σ − ( z t ) A ( z t ) . (75) The bac kwards me ssage passing recu rsion is initialized with m T + 1 , T ∼ N − 1 ( x T ; 0 , 0) . L et, Λ b t | t = C T R − 1 C + Λ t +1 ,t θ b t | t = C T R − 1 y t + θ t +1 ,t . (76) Then we c an define the following recursion, which we note is equiv a lent to a back wards running Kalman filter in information form, Λ b t − 1 | t − 1 = C T R − 1 C + A ( z t ) T Σ − ( z t ) A ( z t ) − A ( z t ) T Σ − ( z t ) (Σ − ( z t ) + C T R − 1 C + Λ t +1 ,t ) − 1 Σ − ( z t ) A ( z t ) = C T R − 1 C + A ( z t ) T Σ − ( z t ) A ( z t ) − A ( z t ) T Σ − ( z t ) (Σ − ( z t ) + Λ b t | t ) − 1 Σ − ( z t ) A ( z t ) θ b t − 1 | t − 1 = C T R − 1 y t − 1 − A ( z t ) T Σ − ( z t ) µ ( z t ) + A ( z t ) T Σ − ( z t ) (Σ − ( z t ) + C T R − 1 C + Λ t +1 ,t ) − 1 · ( C T R − 1 y t + Σ − ( z t ) µ ( z t ) + θ t +1 ,t ) = C T R − 1 y t − 1 − A ( z t ) T Σ − ( z t ) µ ( z t ) + A ( z t ) T Σ − ( z t ) (Σ − ( z t ) + Λ b t | t ) − 1 ( θ b t | t + Σ − ( z t ) µ ( z t ) ) W e initialize at time T with Λ b T | T = C T R − 1 C θ b T | T = C T R − 1 y T (77) An eq uiv alent, but more numerically stab le recursion is summarized in Algorithm 5. After computing t h e message s m t +1 ,t ( x t ) ba ckwards in ti me, we sample t he state seque nce x 1: T working forwards in time. As with the discrete mode seq uence, on e can decompos e the posterior distributi on of the state seq uence as p ( x 1: T | y 1: T , z 1: T , θ ) = p ( x T | x T − 1 , y 1: T , z 1: T , θ ) p ( x T − 1 | x T − 2 , y 1: T , z 1: T , θ ) · · · p ( x 2 | x 1 , y 1: T , z 1: T , θ ) p ( x 1 | y 1: T , z 1: T , θ ) . (78) MIT LIDS TECHNICAL REPOR T #2830 28 1) Initialize filter with Λ b T | T = C T R − 1 C θ b T | T = C T R − 1 y T 2) W orking b ackwards in time, for eac h t ∈ { T − 1 , . . . , 1 } : a) Compute ˜ J t +1 = Λ b t +1 | t +1 (Λ b t +1 | t +1 + Σ − ( z t +1 ) ) − 1 ˜ L t +1 = I − ˜ J t +1 . b) Predict Λ t +1 ,t = A ( z t +1 ) T ( ˜ L t +1 Λ b t +1 | t +1 ˜ L T t +1 + ˜ J t +1 Σ − ( z t +1 ) ˜ J T t +1 ) A ( z t +1 ) θ t +1 ,t = A ( z t +1 ) T ˜ L t +1 ( θ b t +1 | t +1 − Λ b t +1 | t +1 µ ( z t +1 ) ) c) Update Λ b t | t = Λ t +1 ,t + C T R − 1 C θ b t | t = θ t +1 ,t + C T R − 1 y t 3) Set Λ b 0 | 0 = Λ 1 , 0 θ b 0 | 0 = θ 1 , 0 Algorithm 5: Numerically stable form of the backwards Kalman information filter . where p ( x t | x t − 1 , y 1: T , z 1: T , θ ) ∝ p ( x t | x t − 1 , A ( z t ) , Σ ( z t ) , µ ( z t ) ) p ( y t | x t , R ) m t +1 ,t ( x t ) . (79) For the HDP-SLDS, the produc t of these distributions is e quiv alen t to p ( x t | x t − 1 , y 1: T , z 1: T , θ ) ∝ N ( x t ; A ( z t ) x t − 1 + µ ( z t ) , Σ ( z t ) ) N ( y t ; C x t , R ) m t +1 ,t ( x t ) ∝ N ( x t ; A ( z t ) x t − 1 + µ ( z t ) , Σ ( z t ) ) N − 1 ( x t ; θ b t | t , Λ b t | t ) ∝ N − 1 ( x t ; Σ − ( z t ) ( A ( z t ) x t − 1 + µ ( z t ) ) + θ b t | t , Σ − ( z t ) + Λ b t | t ) , (80 ) which is a simple Gaussian distribution s o tha t the normalization constant is eas ily computed. Specifically , for each t ∈ { 1 , . . . , T } we sa mple x t from x t ∼ N ( x t ; (Σ − ( z t ) + Λ b t | t ) − 1 (Σ − ( z t ) A ( z t ) x t − 1 + Σ − ( z t ) µ ( z t ) + θ b t | t ) , (Σ − ( z t ) + Λ b t | t ) − 1 ) . (81) C. Mode Seq uence Message P assing for Sequ ential Sampling A similar sa mpling sc heme to Carter and K oh n [37] is used for ge nerating s amples of the mode s equenc e z 1: T having ma r g inalized over the state se quence x 1: T . Spe cifically , we sa mple z t from: p ( z t = k | z \ t , y 1: T , π , θ ) ∝ p ( z t = k | z \ t , π ) p ( y 1: T | z t = k , z \ t ) ∝ π z t − 1 ( k ) π k ( z t +1 ) p ( y 1: T | z t = k , z \ t ) . (82) MIT LIDS TECHNICAL REPOR T #2830 29 W e omit the depe ndency on π and θ for co mpactnes s. T o compu te the lik elihood for each z t , we combine forward and bac kward mes sages along with the loca l dynamics and measureme nts as follows: p ( y 1: T | z t = k , z \ t ) ∝ Z X t − 1 Z X t m t − 2 ,t − 1 ( x t − 1 ) p ( y t − 1 | x t − 1 ) p ( x t | x t − 1 , z t = k ) p ( y t | x t ) m t +1 ,t ( x t ) d x t d x t − 1 (83) ∝ Z X t Z X t − 1 m t − 2 ,t − 1 ( x t − 1 ) p ( y t − 1 | x t − 1 ) p ( x t | x t − 1 , z t = k ) d x t − 1 p ( y t | x t ) m t +1 ,t ( x t ) d x t , (84) where the backwards message s are define d as in Appen dix B and the forward mes sages by: m t − 1 ,t ( x t ) ∝ Z X t − 1 p ( x t | x t − 1 , z t ) p ( y t − 1 | x t − 1 ) m t − 2 ,t − 1 ( x t − 1 ) d x t − 1 . (85) T o derive the forward me ssage pa ssing recursions, as sume that m t − 2 ,t − 1 ( x t − 1 ) ∝ N − 1 ( x t − 1 ; θ t − 2 ,t − 1 , Λ t − 2 ,t − 1 ) (86) and z t is known. The terms of the integrand of Eq. (85) can be written as : p ( x t | x t − 1 , z t ) = N ( x t ; A ( z t ) x t − 1 + µ ( z t ) , Σ ( z t ) ) (87) ∝ exp − 1 2 x t x t − 1 T Σ − ( z t ) − Σ − ( z t ) A ( z t ) − A ( z t ) T Σ − ( z t ) A ( z t ) T Σ − ( z t ) A ( z t ) x t x t − 1 + x t x t − 1 T Σ − ( z t ) µ ( z t ) − A ( z t ) T Σ − ( z t ) µ ( z t ) m t − 2 ,t − 1 ( x t − 1 ) p ( y t − 1 | x t − 1 ) ∝ N ( x t − 1 ; Λ − f t − 1 | t − 1 θ f t − 1 | t − 1 , Λ f t − 1 | t − 1 ) (88) ∝ exp − 1 2 x t x t − 1 T " 0 0 0 Λ f t − 1 | t − 1 # x t x t − 1 + x t x t − 1 T " 0 θ f t − 1 | t − 1 # , where, similar to the b ackwards recursions, we h av e ma de the follo wing de finitions θ f t | t = θ t − 1 ,t + C T R − 1 y t Λ f t | t = Λ t − 1 ,t + C T R − 1 C. (89) Combining these distrib utions and integrating over x t − 1 , we h av e m t − 1 ,t ( x t ) ∝ N − 1 ( x t ; θ t − 1 ,t , Λ t − 1 ,t ) (90) with θ t − 1 ,t = Σ − ( z t ) µ ( z t ) + Σ − ( z t ) A ( z t ) ( A ( z t ) T Σ − ( z t ) A ( z t ) + Λ f t − 1 | t − 1 ) − 1 ( θ f t − 1 | t − 1 − A ( z t ) T Σ − ( z t ) µ ( z t ) ) Λ t − 1 ,t = Σ − ( z t ) − Σ − ( z t ) A ( z t ) ( A ( z t ) T Σ − ( z t ) A ( z t ) + Λ f t − 1 | t − 1 ) − 1 A ( z t ) T Σ − ( z t ) , or equ i valently , θ t − 1 ,t = Λ t − 1 ,t ( µ ( z t ) + A ( z t ) Λ − f t − 1 | t − 1 θ f t − 1 | t − 1 ) Λ t − 1 ,t = (Σ ( z t ) + A ( z t ) Λ − f t − 1 | t − 1 A ( z t ) T ) − 1 . (91) Assuming x 0 ∼ N (0 , P 0 ) , we initialize at time t = 0 to θ − 1 , 0 = 0 Λ − 1 , 0 = P − 1 0 . (92) An equiv alen t, but more numerically stable recursion is su mmarized in Algorithm 6. Howe ver , this a lgorithm relies on the dynamic matrix A ( k ) being in vertible. MIT LIDS TECHNICAL REPOR T #2830 30 1) Initialize filter with Λ b 0 | 0 = P 0 θ b 0 | 0 = 0 2) W orking forwards in time, for eac h t ∈ { 1 , . . . , T } : a) Compute M t = A − ( z t +1 ) T Λ − f t | t A − ( z t +1 ) J t = M t ( M t + Σ − ( z t +1 ) ) − 1 L t = I − J t . b) Predict Λ t − 1 ,t = L t − 1 M t − 1 L T t − 1 + J t − 1 Σ − ( z t ) J T t − 1 θ t − 1 ,t = L t − 1 A − ( z t ) T ( θ f t − 1 | t − 1 + θ f t − 1 | t − 1 A − ( z t ) µ ( z t ) ) c) Update Λ f t | t = Λ t − 1 ,t + C T R − 1 C θ f t | t = θ t − 1 ,t + C T R − 1 y t Algorithm 6: Numerically stab le form of the forward Kalman information filter . W e no w return to the co mputation of the lik elihood o f Eq. (84). W e n ote that the integral over x t − 1 is e quiv alent to compu ting the mes sage m t − 1 ,t ( x t ) using z t = k . Howe ver , we have to be ca reful tha t a ny c onstants tha t we re previously ignored in this me ssage p assing are not a function of z t . For the meantime, let us assume that there exists such a con stant and le t us de note this spe cial messa ge by m t − 1 ,t ( x t ; z t ) ∝ c ( z t ) N − 1 ( x t ; θ t − 1 ,t ( z t ) , Λ t − 1 ,t ( z t )) . (93) Then, the likelihood c an be written as p ( y 1: T | z t = k , z \ t ) ∝ Z X t m t − 1 ,t ( x t ; z t = k ) p ( y t | x t ) m t +1 ,t ( x t ) d x t (94) ∝ Z X t c ( k ) N − 1 ( x t ; θ t − 1 ,t ( k ) , Λ t − 1 ,t ( k )) N − 1 ( x t ; θ b t | t , Λ b t | t ) d x t (95) Combining the information parame ters, a nd maintaining the term in the n ormalizing co nstant that is a function of k , this is e quiv alen t to p ( y 1: T | z t = k , z \ t ) ∝ c ( k ) | Λ t − 1 ,t ( k ) | 1 / 2 exp − 1 2 θ t − 1 ,t ( k ) T Λ t − 1 ,t ( k ) − 1 θ t − 1 ,t ( k ) Z X t exp − 1 2 x T t (Λ t − 1 ,t ( k ) + Λ b t | t ) x t + x T t ( θ t − 1 ,t ( k ) + θ b t | t ) d x t (96) T o compute this integral, we write the integrand in terms of a Gauss ian distribution times a c onstant. The integral MIT LIDS TECHNICAL REPOR T #2830 31 is then s imply that constan t term: p ( y 1: T | z t = k , z \ t ) ∝ c ( k ) | Λ t − 1 ,t ( k ) | 1 / 2 exp − 1 2 θ t − 1 ,t ( k ) T Λ t − 1 ,t ( k ) − 1 θ t − 1 ,t ( k ) | Λ t − 1 ,t ( k ) + Λ b t | t | − 1 / 2 exp 1 2 ( θ t − 1 ,t ( k ) + θ b t | t ) T (Λ t − 1 ,t ( k ) + Λ b t | t ) − 1 ( θ t − 1 ,t ( k ) + θ b t | t ) Z X t N − 1 ( x t ; θ t − 1 ,t ( k ) + θ b t | t , Λ t − 1 ,t ( k ) + Λ b t | t ) d x t ∝ c ( k ) | Λ t − 1 ,t ( k ) | 1 / 2 | Λ t − 1 ,t ( k ) + Λ b t | t | 1 / 2 exp − 1 2 θ t − 1 ,t ( k ) T Λ t − 1 ,t ( k ) − 1 θ t − 1 ,t ( k ) + 1 2 ( θ t − 1 ,t ( k ) + θ b t | t ) T (Λ t − 1 ,t ( k ) + Λ b t | t ) − 1 ( θ t − 1 ,t ( k ) + θ b t | t ) Thus, p ( z t = k | z \ t , y 1: T , π , θ ) ∝ π z t − 1 ( k ) π k ( z t +1 ) c ( k ) | Λ ( k ) t | 1 / 2 | Λ ( k ) t + Λ b t | t | − 1 / 2 exp − 1 2 θ ( k ) T t Λ − ( k ) t θ ( k ) t + 1 2 ( θ ( k ) t + θ b t | t ) T (Λ ( k ) t + Λ b t | t ) − 1 ( θ ( k ) t + θ b t | t ) (97) W e no w sh ow that c ( z t ) is not a function z t . Th e only place wh ere the pre vious ly ignored de penden cy o n z t arises is from p ( x t | x t − 1 , z t ) . Namely , p ( x t | x t − 1 , z t ) = exp( − 1 2 µ ( z t ) T Σ − ( z t ) µ ( z t ) ) | Σ ( z t ) | 1 / 2 · exponen tial 1 = c 1 ( z t ) · exponential 1 (98) where exponential 1 is the exp onentiated qua dratic of Eq. (87). Then, when compute the mess age m t − 1 ,t ( x t ; z t ) we update the previous mess age m t − 2 ,t − 1 ( x t − 1 ) by incorporating the local likelihood p ( y t − 1 | x t − 1 ) and then propagating the state estimate with p ( x t | x t − 1 , z t ) an d integrating over x t − 1 . Na mely , we c ombine the distrib ution of Eq. (98) with the exponen tiated quadratic of Eq. (88) and integrate over x t − 1 : m t − 1 ,t ( x t ; z t ) ∝ c 1 ( z t ) Z X t − 1 exponential 1 · exponen tial 2 d x t − 1 , (99) where expone ntial 2 is the exponen tiated quadratic of Eq . (88). Since m t − 2 ,t − 1 ( x t − 1 ) ∝ p ( x t − 1 | y 1: t − 2 , z 1: t − 1 ) , a nd the Markov properties of the s tate space mode l dictate p ( x t − 1 | y 1: t − 1 , z 1: t − 1 ) = p ( y t − 1 | x t − 1 ) p ( x t − 1 | y 1: t − 2 , z 1: t − 1 ) ∝ p ( y t − 1 | x t − 1 ) m t − 2 ,t − 1 ( x t − 1 ) , (100) then p ( x t − 1 | y 1: t − 1 , z 1: t − 1 ) = c 2 · exponen tial 2 . W e note that the normalizing cons tant c 2 is not a function of z t since we h ave only cons idered z τ for τ < t . Once again exploiting the conditional independen cies induce d b y the Markov structure of our state spac e model, and plugg ing in Eq. (98) and Eq. (101), p ( x t , x t − 1 | y 1: t − 1 , z 1: t ) = p ( x t − 1 | x t − 1 , z t ) p ( x t − 1 | y 1: t − 1 , z 1: t − 1 ) = ( c 1 ( z t ) · exponential 1 )( c 2 · exponential 2 ) = c 1 ( z t ) c 2 · exponential 1 · exponential 2 . (101) MIT LIDS TECHNICAL REPOR T #2830 32 Plugging this results into Eq. (99), we have m t − 1 ,t ( x t ; z t ) ∝ c 1 ( z t ) Z X t − 1 1 c 1 ( z t ) c 2 p ( x t , x t − 1 | y 1: t − 1 , z 1: t ) d x t − 1 ∝ 1 c 2 p ( x t | y 1: t − 1 , z 1: t ) . (102) Comparing Eq. (10 2) to E q. (93), a nd noting that p ( x t | y 1: t − 1 , z 1: t ) = N − 1 ( x t ; θ t − 1 ,t ( z t ) , Λ t − 1 ,t ( z t )) , we see tha t c ( z t ) = 1 c 2 and is thus not a function of z t . Algebraically , we co uld deriv e this result a s follows. m t − 1 ,t ( x t ; z t ) ∝ c 1 ( z t ) Z X t − 1 exponential 1 · exponen tial 2 d x t − 1 (103) = c 1 ( z t ) c 3 ( z t ) Z X t − 1 N x t − 1 x t ; Λ ( z t ) − 1 θ ( z t ) , Λ ( z t ) d x t − 1 , where θ ( z t ) an d Λ ( z t ) are the information parameters determined by combining the functional forms of exponen tial 1 and expone ntial 2 , and c 1 ( z t ) c 3 ( z t ) = exp {− 1 2 µ ( z t ) T Σ − ( z t ) µ ( z t ) } | Σ ( z t ) | 1 / 2 exp { 1 2 θ ( z t ) T Λ ( z t ) − 1 θ ( z t ) } | Λ ( z t ) | 1 / 2 . (104) Computing these terms in parts, and using stan dard linea r algeb ra properties of b lock matrices, | Λ ( z t ) | = | Σ − ( z t ) || ( A ( z t ) T Σ − ( z t ) A ( z t ) + Λ f t − 1 | t − 1 ) − A ( z t ) T Σ − ( z t ) A ( z t ) | = | Σ − ( z t ) || Λ f t − 1 | t − 1 | (105) Λ ( z t ) − 1 = " (Σ − ( z t ) − Σ − ( z t ) A ( z t ) ˜ Λ ( z t ) − 1 A ( z t ) T Σ − ( z t ) ) − 1 A ( z t ) Λ f t − 1 | t − 1 Λ f t − 1 | t − 1 A ( z t ) T ( ˜ Λ ( z t ) − A ( z t ) T Σ − ( z t ) A ( z t ) ) − 1 # = " Σ ( z t ) + A ( z t ) Λ − f t − 1 | t − 1 A ( z t ) T A ( z t ) Λ f t − 1 | t − 1 Λ f t − 1 | t − 1 A ( z t ) T Λ − f t − 1 | t − 1 # , (106) where ˜ Λ ( z t ) = ( A ( z t ) T Σ − ( z t ) A ( z t ) + Λ f t − 1 | t − 1 ) and we have used the ma trix in version lemma in o btaining the last equality . Us ing this form of Λ ( z t ) − 1 , we rea dily obtain θ ( z t ) T Λ ( z t ) − 1 θ ( z t ) = µ ( z t ) Σ − ( z t ) µ ( z t ) + θ f T t − 1 | t − 1 Λ − f t − 1 | t − 1 θ f t − 1 | t − 1 . (107) Thus, c 1 ( z t ) c 3 ( z t ) = exp { 1 2 θ f T t − 1 | t − 1 Λ − f t − 1 | t − 1 θ f t − 1 | t − 1 } | Λ f t − 1 | t − 1 | 1 / 2 , (108) which does not depe nd upon the value of z t . A P P E N D I X D D E R I V A T I O N O F M A N E U V E R I N G T A R G E T T R AC K I N G S A M P L E R In this Appendix we de ri ve the maneuvering target tr acking (MTT) sampler outlined in Sec. V -B. Recall the MTT model of Eq. (42). As described in Sec. V -B, we are interested in jointly sampling the control input a nd dynamical mod e ( u t , z t ) , marginalizing over the s tate se quence x 1: T , the transition distributi ons π , an d the dynamic parameters θ = { µ ( k ) , Σ ( k ) } . One c an factor the de sired cond itional d istrib ution factorizes as, p ( u t , z t | z \ t , u \ t , y 1: T , β , α, κ, λ ) = p ( z t | z \ t , u \ t , y 1: T , β , α, κ, λ ) p ( u t | z 1: T , u \ t , y 1: T , λ ) . (109) MIT LIDS TECHNICAL REPOR T #2830 33 The distribution in Eq.(109) is a hybrid distrib ution: e ach discrete value of the dynamica l mo de indicator variable z t correspond s to a dif fe rent c ontinuous dis trib ution on the control input u t . W e analyze each of the c onditional distrib u tions of Eq . (109) by cons idering the joint distribution on all of the mod el parameters, and then marginalizing x 1: T , π , and θ k . (Note that marginalization over θ j for j 6 = k s imply results in a constant.) p ( z t = k | z \ t , u \ t , y 1: T , β , α, κ, λ ) ∝ Z π Y j p ( π j | β , α, κ ) Y τ p ( z τ | π z τ − 1 ) dπ Z U t Z p ( θ k | λ ) Y τ | z τ = k p ( u τ | θ k ) dθ k Z X Y τ p ( x τ | x τ − 1 , u τ ) p ( y τ | x τ ) d x 1: T d u t . (110) Similarly , we ca n write the conditional de nsity of u t for ea ch candida te z t as, p ( u t | z t = k , z \ t , u \ t , y 1: T , λ ) ∝ Z p ( θ k | λ ) Y τ | z τ = k p ( u τ | θ k ) dθ k Z X Y τ p ( x τ | x τ − 1 , u τ ) p ( y τ | x τ ) d x 1: T . (111) A key step in deriving these conditional distributions is the marginalization of the s tate sequ ence x 1: T . In pe rforming this mar gina lization, one thing we harness is the fact that c onditioning on the control input sequence simplifies the SLDS to an LDS with a deterministic control inpu t u 1: T . T hus, con ditioning on u 1: t − 1 ,t +1: T allows us to marginalize the s tate se quence in the follo wing manne r . W e run a forward Ka lman filter to pass a messa ge from t − 2 to t − 1 , wh ich is upd ated by the loc al likelihood a t t − 1 . A backward filter is a lso run to pa ss a messag e from t + 1 to t , whic h is updated by the local likelihood a t t . These updated mess ages are combined with the local dynamic p ( x t | x t − 1 , u t , θ ) and then marginalized over x t and x t − 1 , resulting in the likelihood of the ob servation sequen ce y 1: T as a function o f u t , the variable of interest. Bec ause the sampler cond itions on co ntrol inputs, the filter for this time-in variant sys tem c an be efficiently implemented by pre-co mputing the error covari a nces and then solely compu ting local Kalman upd ates a t every time step. Of note is tha t the comp utational co mplexity is linear in the training se quence len gth, as well as the nu mber of cu rrently ins tantiated maneu ver modes . In the follo wing sections, we evaluate each of the integrals of Eq. (111) and Eq. (110) in turn. A. Chinese Restau rant F ranchise The integration over π appea ring in the first line of Eq. (11 0) resu lts in exa ctly the same predic ti ve distributi on as the sticky HDP-HMM [14]. B. Normal-In verse -W ishart P o sterior Up date The marginalization of θ k , app earing both in Eq . (110) a nd Eq. (111), can be rewrit ten a s follows: Z p ( θ k | λ ) Y τ | z τ = k p ( u τ | θ k ) dθ k = Z p ( u t | θ k ) p ( θ k | λ ) Y τ | z τ = k ,τ 6 = t p ( u τ | θ k ) dθ k ∝ Z p ( u t | θ k ) p ( θ k |{ u τ | z τ = k , τ 6 = t } , λ ) dθ k = p ( u t |{ u τ | z τ = k , τ 6 = t } , λ ) . (112) Here, the s et { u τ | z τ = k , τ 6 = t } denotes all the observations u τ other than u t that were drawn from the Ga ussian parameterized by θ k . Whe n θ k has a normal-in verse-W ishart prior N I W ( κ, ϑ , ν, ∆ ) , standard c onjugacy resu lts imply that the pos terior is: p ( u t |{ u τ | z τ = k , τ 6 = t } , κ, ϑ , ν , ∆) ≃ N u t ; ¯ ϑ , ( ¯ κ + 1) ¯ ν ¯ κ ( ¯ ν − d − 1) ¯ ∆ , N ( u t ; ˆ µ k , ˆ Σ k ) , ( 1 13) where ¯ κ = κ + |{ u s | z s = k , s 6 = t }| ¯ ν = ν + |{ u s | z s = k , s 6 = t }| ¯ κ ¯ ϑ = κ ϑ + X u s ∈{ u s | z s = k ,s 6 = t } u s ¯ ν ¯ ∆ = ν ∆ + X u s ∈{ u s | z s = k ,s 6 = t } u s u T s + κ ϑϑ T − ¯ κ ¯ ϑ ¯ ϑ T (114) MIT LIDS TECHNICAL REPOR T #2830 34 Here, we are us ing the moment-matched Gaussian approximation to the Stud ent-t predicti ve distribution for u t induced by ma r g inalizing θ k . C. Mar ginalization by Mes sage P a ssing When con sidering the con trol input u t and conditioning on the values of a ll u τ , τ 6 = t , the marginalization over all states x 1: T can b e e quated to a me ssage p assing sc heme tha t relies on the cond itionally linear dy namical s ystem induced by fix ing u τ , τ 6 = t . Spe cifically , Z X Y τ p ( x τ | x τ − 1 , u τ ) p ( y τ | x τ ) dx ∝ Z X t − 1 Z X t m t − 1 ,t − 2 ( x t − 1 ) p ( y t − 1 | x t − 1 ) p ( x t | x t − 1 , u t ) p ( y t | x t ) m t,t +1 ( x t ) dx t dx t − 1 ∝ p ( y 1: T | u t ; u \ t ) , (115) where we recall the definitions of the forward message s m t − 1 ,t ( x t ) and b ackward mes sages m t +1 ,t ( x t ) fr o m Appendix B. For ou r MTT model o f Eq . (42), howev e r , instead of acc ounting for a process noise mean µ ( z τ ) at time τ in the filtering equ ations, we mu st ac count for the control input u τ . Con ditioning on u τ , one can e quate B u τ with a proce ss noise mea n, and thu s we simply replace µ ( z τ ) with B u τ in the filtering equations of Appe ndix B. Similarly , we rep lace the proce ss noise covariance te rm Σ ( z τ ) with our proc ess noise covariance Q . (Note that although u τ ( z τ ) ∼ N ( µ ( z τ ) , Σ ( z τ ) ) , we condition on the value u τ so that the MTT p arameters { µ ( z τ ) , Σ ( z τ ) } do not factor into the mess age passing eq uations.) D. Combining Mess ages T o comp ute the likelihood of Eq. (115 ), we take the filtered estimates of x t − 1 and x t , combine them with the local dynamics and local likelihood, a nd marginalize over x t − 1 and x t . T o aid in this computa tion, we c onsider the exponentiated quadratic form of each term in the integrand of Eq. (115). W e then join these terms and use standard Gauss ian integration formulas to arri ve at the desired likelihood. The de ri vation o f this likelihood g reatly parallels that for the s equential mode se quenc e sampler of Append ix C. Recall the forward filter rec ursions of Appe ndix B in terms of information parame ters { θ t − 1 ,t , Λ t − 1 ,t , θ f t | t , Λ f t | t } , and the ba ckward fi lter recu rsions in terms o f { θ t +1 ,t , Λ t +1 ,t , θ b t | t , Λ b t | t } . Replace µ ( z t ) with B u t and Σ ( z t ) with Q where a ppropriate. W e many then wri te m t,t +1 ( x t ) updated with the likelihood p ( y t − 1 | x t − 1 ) in exponen tiated quadratic form as: m t − 1 ,t − 2 ( x t − 1 ) p ( y t − 1 | x t − 1 ) ∝ exp − 1 2 x t − 1 x t T C T R − 1 C + Λ t − 1 ,t − 2 0 0 0 x t − 1 x t + x t − 1 x t T C T R − 1 y t − 1 + θ t − 1 ,t − 2 0 . The local dy namics can s imilarly be written as p ( x t | x t − 1 , u t ) ∝ exp − 1 2 u t x t − 1 x t T B T Q − 1 B B T Q − 1 A − B T Q − 1 A T Q − 1 B A T Q − 1 A − A T Q − 1 − Q − 1 B − Q − 1 A Q − 1 u t x t − 1 x t + u t x t − 1 x t T 0 0 0 . MIT LIDS TECHNICAL REPOR T #2830 35 Finally , the ba ckward mes sage m t,t +1 ( x t ) u pdated with the likelihood p ( y t | x t ) can be written as p ( y t | x t ) m t,t +1 ( x t ) ∝ exp − 1 2 x t − 1 x t T 0 0 0 C T R − 1 C + Λ t,t +1 x t − 1 x t + x t − 1 x t T 0 C T R − 1 y t + θ t,t +1 . Using the definitions Λ b t | t = C T R − 1 C + Λ t +1 ,t θ b t | t = C T R − 1 y t + θ t +1 ,t Λ f t | t = C T R − 1 C + Λ t − 1 ,t θ f t | t = C T R − 1 y t + θ t − 1 ,t , we may express the entire integrand as m t − 1 ,t − 2 ( x t − 1 ) p ( y t − 1 | x t − 1 ) p ( x t | x t − 1 , u t ) p ( y t | x t ) m t,t +1 ( x t ) ∝ exp − 1 2 u t x t − 1 x t T B T Q − 1 B B T Q − 1 A − B T Q − 1 A T Q − 1 B A T Q − 1 A + Λ f t − 1 | t − 1 − A T Q − 1 − Q − 1 B − Q − 1 A Q − 1 + Λ b t | t u t x t − 1 x t + u t x t − 1 x t T 0 θ f t − 1 | t − 1 θ b t | t Integrating over x t , we o btain an express ion p roportional to N − 1 u T t x t − 1 ; θ u t x t − 1 , Λ u t x t − 1 , with Λ u t x t − 1 = " B T Q − 1 B B T Q − 1 A A T Q − 1 B A T Q − 1 A + Λ f t − 1 | t − 1 # − B T Q − 1 A T Q − 1 ( Q − 1 + Λ b t | t ) − 1 Q − 1 B Q − 1 A = B T Σ − 1 t B B T Σ − 1 t A A T Σ − 1 t B A T Σ − 1 t A θ u t x t − 1 = " 0 θ f t − 1 | t − 1 # + B T Q − 1 A T Q − 1 ( Q − 1 + Λ b t | t ) − 1 θ b t | t = " B T Q − 1 K − 1 t θ b t | t θ f t − 1 | t − 1 + A T Q − 1 K − 1 t θ b t | t # . Here, we have defined Σ t = Q − 1 + Q − 1 ( Q − 1 + Λ b t | t ) − 1 Q − 1 = Q − 1 + Q − 1 K − 1 t Q − 1 . Finally , integrating over x t − 1 yields a n expression proportional to N − 1 ( u T t ; θ ( u t ) , Λ( u t )) , with Λ( u t ) = B T Σ − 1 t B − B T Σ − 1 t A ( A T Σ − 1 t A + Λ f t − 1 | t − 1 ) − 1 A T Σ − 1 t B θ ( u t ) = B T Q − 1 K − 1 t θ b t | t − B T Σ − 1 t A ( A T Σ − 1 t A + Λ f t − 1 | t − 1 ) − 1 ( θ f t − 1 | t − 1 + A T Q − 1 K − 1 t θ b t | t ) . MIT LIDS TECHNICAL REPOR T #2830 36 E. Joining Distributions that Dep end on u t W e have deriv e d two terms w hich depen d on u t : a prior and a likelihood. Normally , one would con sider p ( u t | θ k ) the prior on u t . Howe ver , through mar g inalization of this parameter , we induc ed depende ncies betwee n the c ontrol inputs u τ and a ll the u τ that we re drawn from a distribution pa rameterized by θ k inform u s of the d istrib u tion over u t . Therefore, we treat p ( u t |{ u τ | z τ = k, τ 6 = t } ) as a pri or d istrib ution on u t . Th e likelihood function p ( y 1: T | u t ; u \ t ) describes the likelihood of an observation seque nce y 1: T giv e n the input seque nce u 1: T , containing the rand om variable is u t . W e mu ltiply the prior distrib ution by the likelihood function to g et the following quadratic expres sion: p ( u t |{ u τ | z τ = k , τ 6 = t } ) p ( y 1: T | u t ; u \ t ) ∝ 1 (2 π ) N/ 2 | ˆ Σ k | 1 / 2 exp − 1 2 ( u t − ˆ µ k ) T ˆ Σ − 1 k ( u t − ˆ µ k ) − 1 2 ( u t − Λ( u t ) − 1 θ ( u t )) T Λ( u t )( u t − Λ t y − 1 θ ( u t )) = 1 (2 π ) N/ 2 | ˆ Σ k | 1 / 2 exp − 1 2 u T t ( ˆ Σ − 1 k + Λ( u t )) u t − 2 u T t ( ˆ Σ − 1 k ˆ µ k + θ ( u t )) + ˆ µ T k ˆ Σ − 1 k ˆ µ k + θ ( u t ) T Λ( u t ) − 1 θ ( u t ) = (2 π ) N/ 2 | ( ˆ Σ − 1 k + Λ( u t )) − 1 | 1 / 2 (2 π ) N/ 2 | ˆ Σ k | 1 / 2 exp − 1 2 ˆ µ T k ˆ Σ − 1 k ˆ µ k + θ ( u t ) T Λ( u t ) − 1 θ ( u t ) − ( ˆ Σ − 1 k ˆ µ k + θ ( u t )) T ( ˆ Σ − 1 k + Λ( u t )) − 1 ( ˆ Σ − 1 k ˆ µ k + θ ( u t )) · N ( u t ; ( ˆ Σ − 1 k + Λ( u t )) − 1 ( ˆ Σ − 1 k ˆ µ k + θ ( u t )) , ( ˆ Σ − 1 k + Λ( u t )) − 1 ) , C k · N ( u t ; ( ˆ Σ − 1 k + Λ( u t )) − 1 ( ˆ Σ − 1 k ˆ µ k + θ ( u t )) , ( ˆ Σ − 1 k + Λ( u t )) − 1 ) , (116) where we note that the define d c onstant C k is a func tion of z t = k , but not of u t . F . Resu lting ( u t , z t ) Sampling Distr ib utions W e write Eq. (11 0) and Eq. (111) in terms of the de ri ved distributions: p ( z t = k | z \ t , u \ t , y 1: T , β , α, κ, λ ) ∝ p ( z t = k | z \ t , β , α, κ ) Z U t p ( u t |{ u τ | z τ = k , τ 6 = t } ) p ( y 1: T | u t ; u \ t ) d u t , (117) p ( u t | z t = k , z \ t , u \ t , y 1: T , λ ) ∝ p ( u t |{ u τ | z τ = k , τ 6 = t } ) p ( y 1: T | u t ; u \ t ) . (118) Thus, the distrib u tion over z t , marginalizing u t , is given b y p ( z t = k | z \ t , u \ t , y 1: T , β , α, κ, λ ) ∝ p ( z t = k | z \ t , β , α, κ ) Z U t C k · N ( u t ; ( ˆ Σ − 1 k + Λ( u t )) − 1 ( ˆ Σ − 1 k ˆ µ k + θ ( u t )) , ( ˆ Σ − 1 k + Λ( u t )) − 1 ) d u t ∝ C k · p ( z t = k | z \ t , β , α, κ ) . (119) and the dis trib ution over u t (for z t = k fixed) is p ( u t | z t = k , z \ t , u \ t , y 1: T , λ ) = N ( u t ; ( ˆ Σ − 1 k + Λ( u t )) − 1 ( ˆ Σ − 1 k ˆ µ k + θ ( u t )) , ( ˆ Σ − 1 k + Λ( u t )) − 1 ) . (120)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment