Mode Clustering for Markov Jump Systems
In this work, we consider the problem of mode clustering in Markov jump models. This model class consists of multiple dynamical modes with a switching sequence that determines how the system switches between them over time. Under different active mod…
Authors: Zhe Du, Necmiye Ozay, Laura Balzano
Mo de Clustering for Mark o v Jump Systems ∗ Zhe Du , Necmiy e Ozay , and Laura Balzano Dep artment of Ele ctric al Engine ering and Computer Scienc e University of Michigan Ann Arb or, USA { zhedu,necmiy e,girasole } @umich.edu Abstract In this w ork, w e consider the problem of mo de clustering in Mark ov jump models. This mo del class consists of m ultiple dynamical mo des with a switc hing sequence that determines how the system switches b et w een them ov er time. Under different activ e modes, the observ ations can hav e differen t characteristics. Giv en the observ ations only and without kno wing the mode sequence, the goal is to cluster the mo des based on their transition distributions in the Marko v chain to find a reduced-rank Mark ov matrix that is em b edded in the original Marko v chain. Our approach in volv es mo de sequence estimation, mo de clustering and reduced-rank mo del estimation, where mo de clustering is achiev ed by applying the singular v alue decomp osition and k-means. W e sho w that, under certain conditions, the clustering error can b e b ounded, and the reduced-rank Mark ov c hain is a go od approximation to the original Mark ov chain. Through sim ulations, we sho w the efficacy of our approac h and the application of our approach to real w orld scenarios. 1 In tro duction Mo deling dynamic systems has been a problem of great in terest in the signal pro cessing and con- trol comm unities for decades. Man y real-w orld phenomena cannot b e describ ed with one dynamical mo del, and so switc hed mo dels wherein the dynamics transition b et w een different system mo dels hav e b een studied and applied widely . In h uman-made systems, for example, a robot ma y hav e differen t dynamics under different battery levels or when different mo dules within the robot fail. In nature, the temp erature and humidit y level will hav e different fluctuations under differen t weather conditions; brain electricit y signals will b eha v e differently under differen t emotions of the test sub ject. Note that in all these examples, the mo des can switch ov er time. T o mo del this switc hing, one systematic and probabilistic w ay is to assume the mo de switching follows a Marko v chain where future mo des do not dep end on past mo des given the most recen t mo de. This Marko v jump mo del [1, 2] has b een used in p o w er systems, air traffic management, economics, and communication systems [3 – 8]. A key challenge for such mo dels is the model compactness – how do es one represent suc h a com- plicated dynamical system with as simple a model as p ossible? F or example, mo des lik e w eather conditions and h uman emotions hav e extremely complex underlying dynamics with strong correlations o ver time. T o satisfy the Marko v prop ert y , one ma y concatenate underlying modes in to a single Mark o v state, and Mark o v c hains built in this w ay will ha v e a state space that grows exp onen tially with the n umber of mo des concatenated in the sequence. The same exp onen tial growth rate applies when one ∗ N. Ozay and Z. Du were supp orted by ONR gran t N00014-18-1-2501, L. Balzano and Z. Du w ere supported by AF OSR YIP aw ard F A9550-19-1-0026, and L. Balzano w as supp orted by AFOSR YIP a ward F A9550-19-1-0026, NSF BIGD A T A a ward I IS-1838179, and NSF CAREER aw ard CCF-1845076. 1 mo dels h uman-made systems with multiple sub-modules that each ha v e m ultiple b ehavior modes (nor- mal/abnormal). Allowing the Mark ov mo del to get extremely large is computationally inefficient for analysis and con trol. Prior work studying model reduction of Mark ov jump mo dels does not consider reduction of discrete state space, i.e. (reduction of n umber of Mark ovian mo des), and prior w ork in state space reduction of Marko v c hain do es not further consider Marko v jump mo dels. There hav e been sev eral works studying the aggregation of states for Mark o v c hains, whic h mainly relies on assumptions such as strong/w eak lumpability , or aggregatibility prop erties of a Marko v c hain [9 – 12]. There is therefore significan t p oten tial in applying the abundant algorithms and theory in Mark ov c hain aggregation to Mark ov jump systems. This can achiev e mo del reduction from a new p ersp ectiv e and will b enefit the analysis and con trol of, esp ecially large, systems. The work presen ted here addresses this gap. W e observe that often times certain modes hav e similar transition b eha viors, and these correlations b et w een the modes can b e exploited to construct a reduced- order mo del. By doing so, one ma y gain more insigh t into the nature of the complex mo del. Moreo ver, w e will ha ve few er parameters to estimate or few er con trol v ariables to design when learning and con trolling the mo del, thus this ma y significan tly reduce the computation burden. W e are in terested in situations where the b ottlenec k is due to a large discrete state-space (i.e., large num b er of mo des) and aim to cluster and aggregate the mo des for reduction. W e achiev e this mo del aggregation by clustering the mo des with similar transition distributions together. W e assume the dynamics for eac h mo de are kno wn, but we hav e no knowledge of the true mo de sequence. In our approach, we cluster based on a reduced-dimension representation of the empirical Mark ov transition matrix. W e then re-estimate the empirical Mark o v matrix using this cluster information, giving us a final low-rank estimate. W e discuss our metho d’s computational adv an tage, and we sho w our approach has guaranteed p erformance in the sense that the clustering error and difference b et ween reduced mo del and the true mo del can b e upp er b ounded. Exp erimen ts show the efficacy of our approach as well as how the p erformance scales with the problem complexit y . 1.1 Prior W ork Previous work on Marko v jump systems includes: analysis of stability and stabilization [13], analysis of system with time dela ys [14], optimal control [15], robust control [16], H ∞ filtering [17], etc. In the con text of mo del reduction, prior work mainly fo cuses on the reduction of contin uous state-space (or observ ation space): [18] studies the H ∞ mo del reduction and deriv es conditions under whic h a reduced order system can b e obtained via linear matrix inequalities; [19] reduces the model order with the help of generalized dissipation inequalities and storage functions; [20] prop oses a balanced truncation algorithm to reduce mo del order and gives upp er bound on approximation error. While, to the b est of our kno wledge, the reduction of discrete state-space (n umber of mo des) for Mark ov jump systems has not b een considered b efore. 2 Problem F orm ulation 2.1 Notation In this paper, b oldface and upp ercase (low ercase) letters denote matrices (vectors); plain letters denote scalars. If A is a matrix, then A ( i, j ) indexes the ( i, j )th element in A and A ( i, j : k ) indexes the row v ector corresp onding to the i th row and column j through k . A ( i, :) indexes the i th row of A . Norms without subscript, i.e. k·k , all denote the ` 2 -norm. W e let [ n ] := { 1 , 2 , . . . , n } and X 0: N := { X i } N i =0 . F or Marko v chain with state space [ n ] and ro w stochastic transition matrix P ∈ R n x n , we let π ∈ R n denote the stationary distribution v ector of P , i.e. π | P = π | . F urthermore, w e let π max := 2 max i π i , π min := min i π i . If P is ergo dic, then π is unique and π min > 0. Let π t ∈ R n denote the transien t state distribution of P and π | t = π | t − 1 P . W e denote with { Ω 1 , . . . , Ω r } a partition of the state space [ n ], where eac h Ω k denotes a cluster of states. W e let Ω ( i ) denote the cluster with i th largest cardinalit y . 2.2 Preliminaries The Mark ov switched mo del w e consider has the following form: y t = n a X i =1 a i ( X t ) y t − i + n c X j =1 c j ( X t ) u t − j + n t , (1) X 0: N ∈ [ n ] N +1 ∼ Marko v c hain( P ) , (2) where y t , u t , n t are scalars and represen t the model output, input and noise at time t respectively . And y t dep ends on { y t − i } n a i =1 , { u t − j } n c j =1 linearly through the parameters { a i ( X t ) } n a i =1 , { c j ( X t ) } n c j =1 from mo de X t at time t . There are n modes in total and the mode sequence X 0: N is assumed to follow a Mark ov chain with ro w sto c hastic Marko v matrix P ∈ R n x n . The initial state distribution π 0 can b e arbitrary . Note that one can omit input u t b y taking n c = 0, whic h corresp onds to an autonomous mo del. If w e let w X t := [ a 1 ( X t ) , . . . , a n a ( X t ) , c 1 ( X t ) , . . . , c n c ( X t )] | , (3) φ t := [ y t − 1 , . . . , y t − n a , u t − 1 , . . . , u t − n c ] | , (4) then w e obtain a simpler representation of the mo del: y t = w | X t φ t + n t , (5) where the pair { y t , φ t } can b e viewed as the observ ation/data. F urthermore, we assume the Marko v matrix P has the following structure: P = ¯ P + ∆ , (6) where ¯ P is a Marko v matrix that is r -aggr e gatable , i.e. there exists an r -cluster partition { Ω 1 , Ω 2 , . . . , Ω r } on the state space [ n ] suc h that ∀ k ∈ [ r ] , ∀ i, j ∈ Ω k , ¯ P ( i, :) = ¯ P ( j, :) . (7) W e assume rank ( ¯ P ) = r , whic h guaran tees there are only r unique ro ws in ¯ P . Matrix ∆ is the p erturbation that accounts for the difference of the true Marko v matrix P and the r -aggregatable Mark ov matrix ¯ P . Note that so far we only assume mo des are clustered based on their similarities in transition distributions and for future w ork w e will take the mo de dynamics and group connectivit y in to account. 2.3 Problem F ormulation Assuming parameters for all the modes { w k } n k =1 are kno wn, given observ ation tra jectory { y t , u t } N t =0 with length N , we w ant to find an r -aggregatable appro ximation e P of P such that the partition information in e P could recov er { Ω 1 , Ω 2 , . . . , Ω r } in ¯ P . W e seek an r -aggregatable approximation of the original Marko v matrix while preserving the clus- tering information in the underlying aggregatable Marko v matrix. Giv en a Marko v chain, one could use the p ow er metho d [21] to iteratively simulate the evolution of the state distribution or compute 3 the stationary distribution. So, one motiv ation to solve the aforemen tioned problem is that, during the p o w er method, it requires O ( n 2 ) scalar m ultiplications in one iteration for P but only O ( r n ) for the r -aggregatable e P . Meanwhile, the compromise in accuracy brought by the reduction of computation can b e upp er b ounded with the following theorem. Theorem 1. The differ enc es b etwe en two Markov matric es P and e P in terms of stationary distribution satisfy k π − ˜ π k 1 ≤ n X i =2 1 1 − λ i ( P ) k P − e P k ∞ . (8) F urthermor e, if P and e P ar e b oth er go dic, their tr ansient distributions and satisfy k π t − ˜ π t k 1 ≤ C ρ t + k π − ˜ π k 1 (9) for some C > 0 and 0 < ρ < 1 . W e can see that as long as the approximation error k P − e P k ∞ is upper bounded, the stationary and transien t behavior differences betw een the true Mark o v matrix P and the r -aggregatable appro ximation e P can b e b ounded. This gives the justification for using e P as a surrogate for P in the p o w er metho d. The distance k P − e P k ∞ with e P obtained from our approach is b ounded in Theorem 4. 3 Our Approac h Our approac h to solv e the problem men tioned ab o ve is giv en in Algorithm 1. In Line 3, we estimate the active mode at time t by pic king the mo de whose dynamics gives the smallest residual error | y t − w | k φ t | . Then, in Line 5, based on the estimated mo de sequence, w e estimate P with the empirical Mark ov matrix b P in whic h the transition probability from mo de i to mo de j is estimated with the frequency of transition pair ( i, j ) with respect to mo de i . In Line 7, w e tak e the SVD of b P and preserv e the first r singular v alue comp onen ts. This is essentially a denoising step that reduces the influence of p erturbation ∆ and estimation error in b P , and the obtained U r is a dimension-reduced represen tation of b P that bears the low-rank structure in ¯ P . Then, we use k-means to estimate the clustering information in ¯ P . Finally , in Line 9, we compute e P b y taking modes within the same estimated cluster as a single mo de and re-computing the empirical Marko v matrix. Note that if a certain mo de do es not show up at all in the tra jectory , i.e. the denominators in Line 5 and Line 9 migh t b e 0, then we simply assign uniform distribution to that mo de, i.e. b P ( i, j ) = 1 /n . W e sho w in the pro of that when the tra jectory is long enough, ev ery mode will show up with high probabilit y . 4 Theoretical Guaran tees 4.1 Relev an t Definitions Before discussing theoretical guaran tees of the prop osed approach, we in tro duce some definitions that will b e used later. Definition 1 (Mixing Time of MC) . L et P ∈ R n x n b e a r ow sto chastic Markov tr ansition matrix with stationary distribution π . Then for al l > 0 , the − mixing time is define d as τ ( ) = min k : max i ∈ [ n ] 1 2 k ( P k )( i, :) | − π k 1 ≤ . (13) Mor e over, we let τ ∗ = τ ( 1 4 ) . 4 Algorithm 1: Mo de Clustering for Marko v Jump Mo del Input: Observ ation { y t , u t } N t =0 , dynamics { w k } n k =1 1 for t = 0 , . . . , N do 2 φ t := [ y t − 1 , . . . , y t − n a , u t − 1 , . . . , u t − n c ] | 3 b X t = arg min k ∈ [ n ] | y t − w | k φ t | 4 end 5 Compute empirical Marko v matrix: b P ( i, j ) = P N t =1 1 { b X t − 1 = i, b X t = j } P N t =1 1 { b X t − 1 = i } (10) 6 SVD decomp osition: b P = UΣV | 7 U r = U (: , 1: r ) 8 Solv e the follo wing k-means problem: b Ω 1: r , ˆ c 1: r = arg min b Ω 1 ,..., b Ω r ˆ c 1 ,..., ˆ c r r X k =1 X i ∈ b Ω k k U r ( i, :) − ˆ c k k 2 (11) 9 Aggregatable approximation: assume i ∈ b Ω s e P ( i, j ) = P k ∈ b Ω s P N t =1 1 { b X t − 1 = k , b X t = j } P k ∈ b Ω s P N t =1 1 { b X t − 1 = k } (12) Output: Partition { b Ω 1 , . . . , b Ω r } and matrix e P Since k-means is used in Algorithm 1, w e assume a (1 + ) solution to the k-means problem can b e obtained and later sho w how affects the ov erall clustering error. Definition 2 (Approximate Solution to k-means Clustering Problem) . F or pr oblem in (11) , we say b Ω 1 , . . . , b Ω r , ˆ c 1 , . . . , ˆ c r is a (1 + ) solution if r X s =1 X i ∈ b Ω s k U r ( i, :) − ˆ c s k 2 ≤ (1 + ) min Ω 1 ,..., Ω r c 1 ,..., c r r X s =1 X i ∈ Ω s k U r ( i, :) − c s k 2 . (14) Definition 3 (Misclustering Rate) . L et { Ω 1 , Ω 2 , . . . , Ω r } b e the underlying true clustering p artition of [ n ] and { b Ω 1 , b Ω 2 , . . ., b Ω r } b e an estimate of the true p artition. We define misclustering rate of { b Ω 1 , b Ω 2 , . . ., b Ω r } as M R ( b Ω 1 , b Ω 2 , . . . , b Ω r ) = min k ∈K r X j =1 |{ i : i ∈ Ω j ; i / ∈ b Ω k ( j ) }| | Ω j | , (15) wher e K is the set of al l bije ctions fr om [ r ] to [ r ] . Since the partition is inv arian t to the lab els of clusters, when w e ev aluate the misclustering rate, w e compute the error under the b est lab el matc hing, which is the reason we need K . Note that in (15), eac h summand has numerator no larger than the its denominator, so M ( b Ω 1 , b Ω 2 , . . . , b Ω r ) ≤ r trivially . 5 4.2 Main Results Let N 0 := P N − 1 t =0 1 { b X t 6 = X t } denote the n umber of mistak es in the estimated mo de sequence and η := N 0 N denote the mistake rate. In the follo wing analyses, Lemma 2 giv es conditions under whic h N 0 = 0. Theorem 3 and Theorem 4 giv e the upp er b ounds on misclustering rate and appro ximation error. Lemma 2. Assume for al l t, | n t | < n max and for al l j ∈ [ n ] \ X t , | φ | t ( w X t − w j ) | > 2 n max , (16) then the se quenc e estimate d in Line 3 of A lgorithm 1 is c orr e ct, i.e. N 0 = 0 . When n t = 0, the dynamics giv en in (5) defines a h yp erplane plus noise. Data p oin ts at the inter- section of these h yp erplanes (a set of measure zero in the noiseless case) are not useful in distinguishing the mo de. (16) essen tially means that such data p oin ts do not exist. Theorem 3. Assume: (i) the fr amework in Se ction 2.2 holds; (ii) P is er go dic; (iii) { b Ω 1 , . . . , b Ω r } is a (1 + 1 ) solution to the k-me ans pr oblem; (iv) k ∆ k ≤ σ r ( ¯ P ) 8 √ (2+ 1 ) r r | Ω ( r ) | | Ω (1) | + 1 ; (v) mistake r ate η < π min 2 . Then for al l 2 > 0 , let ˜ 2 = min 2 , π min 2 − η , π min 4( σ 1 ( ¯ P )+ k ∆ k ) σ r ( ¯ P ) 8 √ (2+ 1 ) r r | Ω ( r ) | | Ω (1) | + 1 − k ∆ k , if N ≥ 200 τ ∗ π max log(˜ − 1 2 )˜ − 2 2 [log(24 nτ ∗ ) + log(log ( ˜ 2 − 1 ))] , with pr ob ability no less than 1 − exp − N 200 τ ∗ π max log(˜ − 1 2 )˜ − 2 2 , (17) we have M R ( b Ω 1 , b Ω 2 , . . . , b Ω r ) ≤ 64(2 + 1 ) r k ∆ k σ r ( ¯ P ) + 4( 2 + 1 . 5 η )( k ∆ k + k ¯ P k ) π min σ r ( ¯ P ) 2 . (18) In Theorem 3, the ergo dicit y condition on Marko v matrix P and mistake rate η ≤ π min 2 guaran tees that P can b e w ell learned from a single tra jectory . When 2 is small enough, ˜ 2 b ecomes 2 , which will b e more in terpretable for the probability and tra jectory length low er b ounds. Through some further insp ection of Theorem 3, w e could see the bounds improv e as an y of the follo wing decreases: n umber of mo des n , n umber of clusters r , p erturbation k ∆ k , mixing time τ ∗ , condition num b er σ 1 ( ¯ P ) /σ r ( ¯ P ), and disparities in stationary distribution π and cluster p opulation, namely π max /π min and | Ω (1) | / | Ω ( r ) | . The disparities pla y a role here b ecause as disparities increases, certain mo des or clusters ma y b e dominated b y the others and b ecome less lik ely to show up in the data. This will mak e them less learned in the algorithm and the estimation and clustering error will increase accordingly . Theorem 4. Under the same c onditions as The or em 3, if M R = 0 , then with the same pr ob ability lower b ound we c ould have k P − e P k ∞ ≤ 12 √ nπ − 1 min σ 1 ( P )( 2 + 1 . 5 η ) + 2 k ∆ k ∞ . (19) Theorem 4 gives the upper b ound on the appro ximation error of e P , whic h can b e used to upp er b ound the stationary and transient b eha vior differences in Theorem 1. The limitation of the theorem is that the result holds only when the clustering error is 0. 6 5 Exp erimen ts 5.1 Syn thetic Data W e first study the p erformance of our approach with syn thetic data. In the Marko v jump mo del, we let n a =3 , n c =2 and n um b er of modes n = 50. F or eac h mo de, the dynamics are generated by uniformly sampling its p oles on ( − 1 , 1). W e let input u t ∼ N (0 , 1) and noise n t ∼ U nif ( − n max , n max ). The state space [ n ] is partitioned in to r clusters Ω 1: r randomly such that ev ery p ossible partition is sampled with equal probability . The mo de transition probabilities ¯ P (Ω k , :) for every k and initial mo de distribution π 0 are sampled from uniform Diric hlet distribution. The error metrics w e ev aluate are: (i) clustering error CE = n − 1 min k ∈K P r j =1 |{ i : i ∈ Ω j ; i / ∈ b Ω k ( j ) }| where K is giv en in Definition 3; (ii) k e π − π k 1 , i.e. the difference b etw een e P and P in terms of stationary distributions. F or each parameter setup, we record the a verage of these tw o metrics o ver 100 exp erimen ts. 5.1.1 Without P erturbation (∆ = 0 ) W e first ev aluate how the p erformance dep end on num b er of clusters r and noise magnitude n max . W e set p erturbation ∆ = 0 for these test cases. The exp erimen t results are given in Fig.(1a-1d). W e set n max = 0 . 1 in Fig.(1a-1b) and r = 6 in Fig.(1c-1d). 5.1.2 With P erturbation (∆ 6 = 0 ) In this test case, w e fix n = 50 , r = 6 , n max = 0 . 05 , N = 10 5 . The space of ∆ is a p olytop e whic h makes it difficult to sample uniformly , so instead for i ∈ Ω k , w e sample P ( i, :) from Dirichlet distribution with parameters α P (Ω k , :) and record ∆ = P − ¯ P . In this case, E [ P ( i, :)] = P (Ω k , :) and α con trols ho w m uch P ( i, :) deviates from P (Ω k , :). W e sweep α and use scatter plots Fig.(1e-1f) to sho w ho w the error metrics v ary with k ∆ k . 5.2 Practically Motiv ated Example—P atrol Rob ot No w w e consider a more realistic case inv olving Marko v jump system that can p ossibly b enefit from our approach. Assume in a region, we hav e n stations eac h with p osition p i ∈ R and at time t there is only one activ e station s t that generates requests; the sequence of active stations s 0: t follo ws a Marko v c hain P . There is a rob ot with p osition x t ∈ R at time t aiming to reach the active station as fast and close as p ossible. Assuming the dynamics and con trol law of the rob ot are giv en by x t +1 = x t + u t + n t , u t = K ( p s t − x t ) , (20) the closed-lo op dynamics take the form x t +1 = (1 − K ) x t + K p s t + n t , (21) whic h is a Mark ov jump mo del. In this setting, if the underlying Marko v chain b ears aggregatability prop ert y to some extent, w e could use our approach to uncov er the corresp onding partition of mo des as w ell as find an approximation of Mark o v transition matrix with stationary distribution that is easier to compute. Understanding the similarities b etw een the stations’ activ ation schedule can b e useful to design impro ved control strategies for the rob ot. In the exp erimen t, we set n = 50 , p i = i, K = 0 . 7 , n t ∼ N (0 , 0 . 1) , N = 10 6 and sample ¯ P , P same as 5.1.1. Over the av erage of 100 runs, clustering error CE = 0 . 04 and k e π − π k 1 = 0 . 07. 7 10 3 10 4 10 5 10 6 N 0 0.1 0.2 0.3 0.4 0.5 0.6 CE (a) 10 3 10 4 10 5 10 6 N 0 0.05 0.1 0.15 0.2 0.25 0.3 (b) 10 3 10 4 10 5 10 6 N 0 0.2 0.4 0.6 0.8 CE (c) 10 3 10 4 10 5 10 6 N 0 0.05 0.1 0.15 0.2 0.25 0.3 (d) 10 -2 10 -1 10 0 0 0.2 0.4 0.6 0.8 CE (e) 10 -2 10 -1 10 0 0 0.05 0.1 0.15 (f ) Figure 1: Performance vs: (a,b) N and r ; (c,d) N and n max ; (e,f ) k ∆ k 6 Conclusions & F uture W ork In this pap er, w e consider the problem of model aggregation for Marko v jump system from the p er- sp ectiv e of clustering the mo des based on their transition distributions. The proposed approac h has guaran teed clustering error upp er b ound and exhibits decent p erformance in the exp erimen ts. There are several in teresting directions for future work: (i) w e will see how lumpable Mark ov c hain can help reformulate the mo del reduction problem; (ii) in the algorithm, after obtaining an estimate of the Mark ov transition matrix, one might use it to get a b etter estimate of the mode sequence, so sev eral iterations b et ween estimating switching sequence and Marko v transition matrix may make b oth estimates more accurate; (iii) after the mo de clustering, it is worth in vestigating if w e could use a single mo de to characterize the switc hing dynamics of all the modes within the cluster so that w e could truly reduce the n umber of mo des in the mo del. 8 References [1] V. Gupta, R. M. Murray , L. Shi, and B. Sinopoli, “Net work ed sensing, estimation and con trol systems,” California Institute of T e chnolo gy R ep ort , 2009. [2] P . Shi and F. Li, “A survey on marko vian jump systems: modeling and design,” International Journal of Contr ol, Automation and Systems , vol. 13, no. 1, pp. 1–16, 2015. [3] V. Ugrinovskii* and H. R. Pota, “Decentralized con trol of p o wer systems via robust control of uncertain mark ov jump parameter systems,” International Journal of Contr ol , vol. 78, no. 9, pp. 662–677, 2005. [4] K. Loparo and F. Ab del-Malek, “A probabilistic approac h to dynamic p o wer system security ,” IEEE tr ansactions on cir cuits and systems , vol. 37, no. 6, pp. 787–798, 1990. [5] W. Liu and I. Hwang, “Probabilistic tra jectory prediction and conflict detection for air traffic con trol,” Journal of Guidanc e, Contr ol, and Dynamics , vol. 34, no. 6, pp. 1779–1789, 2011. [6] K. Gopalakrishnan and H. Balakrishnan, “A comparative analysis of mo dels for predicting delays in air traffic net works.” A TM Seminar, 2017. [7] L. E. Sv ensson, N. Williams et al. , “Optimal monetary p olicy under uncertain ty: A mark ov jump- linear-quadratic approach,” F e der al R eserve Bank of St. L ouis R eview , v ol. 90, no. 4, pp. 275–293, 2008. [8] Y. Zh u, Z. Zhong, W. X. Zheng, and D. Zhou, “Hmm-based H ∞ filtering for discrete-time marko v jump lp v systems ov er unreliable comm unication channels,” IEEE T r ansactions on Systems, Man, and Cyb ernetics: Systems , v ol. 48, no. 12, pp. 2035–2046, 2017. [9] J. Sanders, A. Prouti` ere, and S.-Y. Y un, “Clustering in block mark ov c hains,” arXiv pr eprint arXiv:1712.09232 , 2017. [10] A. Ganguly , T. P etrov, and H. Ko eppl, “Marko v c hain aggregation and its applications to combi- natorial reaction net works,” Journal of mathematic al biolo gy , vol. 69, no. 3, pp. 767–797, 2014. [11] K. Deng, Y. Sun, P . G. Meh ta, and S. P . Meyn, “An information-theoretic framew ork to aggregate a mark ov chain,” in 2009 Americ an Contr ol Confer enc e . IEEE, 2009, pp. 731–736. [12] A. Zhang and M. W ang, “State compression of mark o v pro cesses via empirical low-rank estima- tion,” arXiv pr eprint arXiv:1802.02920 , 2018. [13] L. Zhang and E.-K. Bouk as, “Stabilit y and stabilization of mark o vian jump linear systems with partly unkno wn transition probabilities,” Automatic a , v ol. 45, no. 2, pp. 463–468, 2009. [14] L. Zhang, E.-K. Bouk as, and J. Lam, “Analysis and syn thesis of mark o v jump linear systems with time-v arying delays and partially known transition probabilities,” IEEE T r ansactions on A utomatic Contr ol , vol. 53, no. 10, pp. 2458–2464, 2008. [15] O. L. Costa and M. D. F ragoso, “Discrete-time lq-optimal con trol problems for infinite mark ov jump parameter systems,” IEEE T r ansactions on Automatic Contr ol , v ol. 40, no. 12, pp. 2076– 2088, 1995. [16] J. Dong and G.-H. Y ang, “Robust h2 con trol of contin uous-time mark ov jump linear systems,” A utomatic a , vol. 44, no. 5, pp. 1431–1436, 2008. 9 [17] A. P . Gon¸ calv es, A. R. Fiora v anti, and J. C. Geromel, “ H ∞ filtering of discrete-time mark ov jump linear systems through linear matrix inequalities,” IEEE T r ansactions on A utomatic Contr ol , v ol. 54, no. 6, pp. 1347–1351, 2009. [18] L. Zhang, B. Huang, and J. Lam, “ H ∞ mo del reduction of mark o vian jump linear systems,” Systems & Contr ol L etters , vol. 50, no. 2, pp. 103–118, 2003. [19] G. Kotsalis, A. Megretski, and M. A. Dahleh, “Mo del reduction of discrete-time mark ov jump linear systems,” in 2006 A meric an Contr ol Confer enc e . IEEE, 2006, pp. 6–pp. [20] G. Kotsalis and A. Rantzer, “Balanced truncation for discrete time marko v jump linear systems,” IEEE T r ansactions on Automatic Contr ol , v ol. 55, no. 11, pp. 2606–2611, 2010. [21] W. J. Stewart, “A comparison of numerical techniques in mark ov mo deling,” Commun. A CM , v ol. 21, no. 2, pp. 144–152, F eb. 1978. [Online]. Av ailable: http://doi.acm.org/10.1145/359340. 359350 [22] A. V. Kn y azev and M. E. Argentati, “Principal angles b et ween subspaces in an a-based scalar pro duct: algorithms and p erturbation estimates,” SIAM Journal on Scientific Computing , v ol. 23, no. 6, pp. 2008–2040, 2002. [23] P .- ˚ A. W edin, “P erturbation b ounds in connection with singular v alue decomp osition,” BIT Nu- meric al Mathematics , vol. 12, no. 1, pp. 99–111, 1972. [24] H. W eyl, “Das asymptotisc he v erteilungsgesetz der eigenw erte linearer partieller differentialgle- ic hungen (mit einer anw endung auf die theorie der hohlraumstrahlung),” Mathematische Annalen , v ol. 71, no. 4, pp. 441–479, Dec 1912. [Online]. Av ailable: https://doi.org/10.1007/BF01456804 [25] J. Lei, A. Rinaldo et al. , “Consistency of sp ectral clustering in stochastic block mo dels,” The A nnals of Statistics , vol. 43, no. 1, pp. 215–237, 2015. [26] G. E. Cho and C. D. Meyer, “Comparison of p erturbation b ounds for the stationary distribution of a mark ov chain,” Line ar Algebr a and its Applic ations , vol. 335, no. 1-3, pp. 137–150, 2001. [27] D. A. Levin and Y. P eres, Markov chains and mixing times . American Mathematical So c., 2017, v ol. 107. A T ec hnical Lemmas This section pro vides technical lemmas that are used to prov e the main results of the pap er. The pro ofs for main theorems are given later in Section B. A.1 Lemmas on Matrix P erturbation Theory Note that in Line 7 of Algorithm 1, SVD is p erformed on matrix b P . T o analyze this step, in this section, we giv e a few lemmas ab out how the perturbation of a matrix will affect its left singular v ector matrix. Definition 4 (Distance betw een column spaces) . F or two matric es E , F ∈ R n x n , r ank ( E )= rank ( F )= r , let E , F denote their c olumn sp ac es and Π E and Π F denote c orr esp onding pr oje ction matric es of the c olumn sp ac es. Then, the princip al angles [22] θ 1 , . . . , θ r b etwe en E , F c an b e shown to b e the ar csines of first r singular values of matrix ( I − Π E ) Π F . L et sin Θ ( E , F ) := diag (sin( θ 1 ) , . . . , sin( θ r )) . The sin Θ distanc e b etwe en c olumn sp ac e of E and F is define d as k sin Θ ( E , F ) k F , which satisfies k sin Θ ( E , F ) k F = k ( I − Π E ) Π F k F . (22) 10 Lemma 5 (W edin’s P erturbation Theorem [23]) . L et A , b A ∈ R m x n . L et A = UΣV | b e the SVD of A wher e singular values on the diagonal of Σ ar e arr ange d in desc ending or der. L et U r = U (: , 1: r ) , Σ r = Σ (1: r , 1: r ) , V r = V (: , 1: r ) denote the first r singular c omp onents of matrix A . Similarly, let { b U r , b Σ r , b V r } denote the first r singular c omp onents of matrix b A . Then, if σ r ( b A ) − σ r +1 ( A ) > 0 , max {k sin Θ ( V r , b V r ) k , k sin Θ ( U r , b U r ) k} ≤ max {k ( A − b A ) b V r k , k b U | r ( A − b A ) k} σ r ( b A ) − σ r +1 ( A ) . (23) Otherwise (note that in this c ase, we have σ r ( A ) − σ r +1 ( b A ) ≥ 0 ), we have max {k sin Θ ( V r , b V r ) k , k sin Θ ( U r , b U r ) k} ≤ max {k ( A − b A ) V r k , k U | r ( A − b A ) k} σ r ( A ) − σ r +1 ( b A ) . (24) Note that the norm in the ab ove e quations c an b e r eplac e d with any unitarily invariant norm. Lemma 6 (W eyl’s Bound [24]) . Given A , b A ∈ R m x n , their singular values satisfy max i ≤ min { m,n } | σ i ( A ) − σ i ( b A ) | ≤ k A − b A k . (25) Lemma 7 (Combination of W edin’s P erturbation Theorem and W eyl’s Bound) . L et A , b A ∈ R m x n . L et A = UΣV | b e the SVD of A wher e singular values on the diagonal of Σ ar e arr ange d in desc ending or der. L et U r = U (: , 1: r ) , Σ r = Σ (1: r , 1: r ) , V r = V (: , 1: r ) denote the first r singular c omp onents of matrix A . Similarly, let { b U r , b Σ r , b V r } denote the first r singular c omp onents of matrix b A . Then max {k sin Θ ( V r , b V r ) k , k sin Θ ( U r , b U r ) k} ≤ 2 k A − b A k σ r ( A ) − σ r +1 ( A ) . (26) Pro of. Note that if 2 k A − b A k ≥ σ r ( A ) − σ r +1 ( A ), then (26) holds trivially since k sin Θ ( · , · ) k ≤ 1. So, it suffices to consider the case when 2 k A − b A k < σ r ( A ) − σ r +1 ( A ). Using W eyl’s b ound in Lemma 6, we hav e σ r ( b A ) ≥ σ r ( A ) − k A − b A k , (27) σ r +1 ( b A ) ≤ σ r +1 ( A ) + k A − b A k . (28) F or the case when σ r ( b A ) − σ r +1 ( A ) > 0, W edin’s p erturbation theorem in Lemma 5 can give max {k sin Θ ( V r , b V r ) k , k sin Θ ( U r , b U r ) k} ≤ k A − b A k σ r ( b A ) − σ r +1 ( A ) . (29) Using (27), we hav e max {k sin Θ ( V r , b V r ) k , k sin Θ ( U r , b U r ) k} ≤ k A − b A k σ r ( A ) − σ r +1 ( A ) − k A − b A k . (30) The condition 2 k A − b A k < σ r ( A ) − σ r +1 ( A ) implies that the RHS of (30) is nonnegative and strictly smaller than 1, so adding k A − b A k to b oth the n umerator and denominator of the RHS of (30) shows (26). F or the case when σ r ( b A ) − σ r +1 ( A ) ≤ 0, the same result can b e shown using W edin’s p erturbation b ound, equation (28) and similar argument. 11 Lemma 8. L et U 1 , U 2 ∈ R n x r wher e n ≥ r and U | 1 U 1 = U | 2 U 2 = I . Then min O ∈ O ( r ) k U 1 − U 2 O k 2 F ≤ 2 k sin Θ ( U 1 , U 2 ) k 2 F , (31) wher e O ( r ) denotes the ortho gonal gr oup with dimension r , i.e. the set of al l r -dimensional orthonormal matric es. Pro of. Let the singular v alue decomp osition of matrix U | 1 U 2 b e U | 1 U 2 = WΣV | . F or an y orthonor- mal O , w e hav e k U 1 − U 2 O k 2 F = tr [( U 1 − U 2 O )( U | 1 − O | U | 2 )] = tr [ U 1 U | 1 + U 2 U | 2 − 2 OU | 1 U 2 ] =2 r − 2 tr ( O WΣV | ) =2 r − 2 tr ( Σ e O ) ( e O := V | O W ) =2 r − 2 r X i =1 Σ ( i, i ) e O ( i, i ) . (32) F ollowing the definition of e O , w e see e O is orthonormal, th us e O ( i, i ) ≤ 1 , ∀ i ∈ [ r ]. Also note that Σ ( i, i ) ≥ 0, we ha ve P r i =1 Σ ( i, i ) e O ( i, i ) ≤ P r i Σ ( i, i ) = tr ( Σ ), where equalit y holds when e O ( i, i ) = 1 , ∀ i ∈ [ r ], i.e. e O = I and O = VW | . Therefore, min O ∈ O ( r ) k U 1 − U 2 O k 2 F ≤ 2( r − tr ( Σ )) . (33) F or the RHS of (31), w e hav e k sin Θ ( U 1 , U 2 ) k 2 F = k ( I − U 1 U | 1 ) U 2 U | 2 k 2 F = tr [ U 2 U | 2 ( I − U 1 U | 1 )( I − U 1 U | 1 ) U 2 U | 2 ] = tr [( I − U 1 U | 1 ) U 2 U | 2 ] = tr ( U 2 U | 2 ) − tr ( U | 1 U 2 U | 2 U 1 ) = r − tr ( WΣV | VΣW | ) = r − tr ( Σ 2 ) , (34) where the first equalit y follows Definition 4. Since Σ ( i, i ) ≤ 1 , ∀ i ∈ [ r ], w e hav e tr ( Σ ) ≥ tr ( Σ 2 ). F ollowing this and together with (33) and (34), we can conclude the pro of. A.2 Lemmas on Clustering In this section, w e presen t some lemmas that can be applied to the clustering analysis in our algorithm. Definition 5 (Membership Matrix and Mem b ership Matrix Set) . We c al l a matrix A ∈ R n x r wher e r ≤ n a membership matrix for n p oints and r clusters if e ach r ow of A has exactly one element e qual to 1 and 0 ’s for the r est of elements. And A ( i, j ) = 1 if and only if “p oint i b elongs to cluster j ”. We let M n,r := { M | M ∈ R n x r , M is a memb ership matrix } denote the set of al l memb ership matric es for n p oints and r clusters. Remark. Note that memb ership matrix is p ermutation invariant in the sense that, for any p ermutation matrix Q ∈ R r x r , memb ership matrix A and AQ r epr esent the same memb ership information and the only differ enc e is that Q changes the cluster lab els. So, the p ermutation invarianc e establishes an e quivalenc e r elation among the set M n,r . 12 Lemma 9. L et X ∈ R n x m b e an arbitr ary matrix with factorization X = M X C X , wher e M X ∈ M n,r is a memb ership matrix and C X ∈ R r x m , r ank ( C X ) = r . L et Ω i = { j | M X ( j, i ) = 1 } , ∀ i ∈ [ r ] and U r Σ r V | r b e the SVD of X , wh ich pr eserves only the first r singular value c omp onents. Then, for any i ∈ Ω k and any j ∈ Ω l , k U r ( i, :) − U r ( j, :) k = ( 0 if k = l q 1 | Ω k | + 1 | Ω l | if k 6 = l . (35) Pro of. Since X = M X C X and C X has full ro w rank, w e see r ank ( X ) = r , th us Σ r V | r has full ro w rank as w ell. F rom this, w e ha ve (i) ∀ k , ∀ i, j ∈ Ω k , X ( i, :) = X ( j, :), so U r ( i, :) = U r ( j, :); (ii) ∀ k 6 = l, ∀ i ∈ Ω k , ∀ j ∈ Ω l , X ( i, :) 6 = X ( j, :), so U r ( i, :) 6 = U r ( j, :). Therefore, U r has factorization U r = M X C for some C ∈ R r x r . In another wa y , there are only r unique ro ws in U r with C collecting the unique ro ws and M X b eing the membership matrix shared with X . Since I = U | r U r = C T M | X M X C = C | diag ([ | Ω 1 | , | Ω 2 | , . . . , | Ω r | ]) C , we can see that matrix C has orthogonal ro ws and k C ( i, :) k = 1 √ | Ω i | , ∀ i ∈ [ r ]. Therefore, ∀ i ∈ Ω k , j ∈ Ω l , k U r ( i, :) − U r ( j, :) k = k C ( k, :) − C ( l , :) k = ( 0 if k = l q 1 | Ω k | + 1 | Ω l | if k 6 = l . (36) Lemma 10 (Approximate k-means error b ound, Lemma 5.3 in [25]) . F or > 0 and any two matric es U , ¯ U ∈ R n x r such that ¯ U = ¯ M ¯ C with ¯ M ∈ M n,r , ¯ C ∈ R r x r , let { M , C } b e a (1 + ) solution to the k-me ans pr oblem on U : M ∈ M n,r , C ∈ R r x r s.t. k MC − U k 2 F ≤ (1 + ) min M 0 ∈ M n,r , C 0 ∈ R r x r k M 0 C 0 − U k 2 F . L et Ω k = { i | i ∈ [ n ] , ¯ M ( i, k ) = 1 } denote the set of al l p oints b elonging to cluster k. F or any δ k ≤ min l 6 = k k ¯ C ( l, :) − ¯ C ( k , :) k , ∀ k ∈ [ r ] , (37) define the set S k = { i | i ∈ Ω k , k ( MC )( i, :) − ¯ U ( i, :) k ≥ δ k / 2 } , then r X k =1 | S k | δ 2 k ≤ 4(4 + 2 ) k ¯ U − U k 2 F . (38) Mor e over, define G = S r k =1 (Ω k \ S k ) . If (16 + 8 ) k ¯ U − U k 2 F /δ 2 k < | Ω k | , ∀ k ∈ [ r ] , (39) then ther e exists an r × r p ermutation matrix J such that ¯ M ( G, :) = M ( G, :) J , i.e. M and ¯ M shar e the same memb ership information for p oints in the set G. Remark. L emma 10 c an b e use d to b ound the numb er of mis-cluster e d p oints by k-me ans. In this lemma, U r epr esents the data matrix (p ossibly dimension r e duc e d) for n data p oints. One applies k-me ans to U and obtains the memb ership matrix M and cluster c enters C . ¯ U r epr esents the “cle an ” data such that data p oints within the same true cluster have exactly the same r ows and its memb ership information ¯ M is what one wants to r e c over and c omp ar es with. A nd its main take away is, under (38) , at le ast the p oints in set G c an b e cluster e d c orr e ctly fr om any (1 + ) solution of the k-me ans pr oblem. 13 A.3 Lemmas on Mark o v Chain Concen tration In this section, Lemma 13 provides the estimation error b ounds for Marko v matrix estimation from a single tra jectory generated by the Marko v chain. Lemma 14 analyzes the case when certain states in the tra jectory are p erturbed. Lemma 11 and Lemma 12 are building blo c ks to wards Lemma 13 and Lemma 14. Lemma 11 (Lemma 5 in [12]) . L et τ ( ) b e the mixing time of Markov chain given in Definition 1. F or any ≤ δ < 1 / 2 , we have τ ( ) ≤ τ ( δ ) log( /δ ) log(2 δ ) + 1 . (40) Lemma 12 (Lemma 7 in [12]) . L et P ∈ R n x n b e an er go dic r ow sto chastic matrix with stationary distribution π ∈ R n . L et π max = max i π i , π min = min i π i . L et τ ( · ) denote the mixing time of P , which is given in Definition 1. L et F = diag ( π ) P denote the stationary fr e quency matrix. Given a tr aje ctory X 0: N of the Markov chain, define b π 0 ∈ R 1 x n and e F ∈ R n x n as b π 0 ( i ) = 1 N N X t =1 1 { X t − 1 = i } , (41) b F 0 ( i, j ) = 1 N N X t =1 1 { X t − 1 = i, X t = j } . (42) F or any > 0 , let α = τ (min( / 2 , π max )) + 1 , then P k b F 0 − F k ≥ ≤ 2 αn exp − N 2 / 8 2 π max α + α/ 6 , (43) P ( k b π 0 − π k ∞ ≥ ) ≤ 2 α n exp − N 2 / 8 2 π max α + α/ 6 . (44) Lemma 13. Consider the Markov chain given in L emma 12 and its tr aje ctory X 0: N . Define τ ∗ = τ (1 / 4) and b P 0 ∈ R n x n as b P 0 ( i, j ) = P N t =1 1 { X t − 1 = i,X t = j } P N t =1 1 { X t − 1 = i } if P N t =1 1 { X t − 1 = i } 6 = 0 1 /n o.w. , (45) F or any > 0 , let ˜ = min { π min / 2 , } , then we have P k b P 0 − P k ≤ 4 π − 1 min k P k ≥ 1 − 24 nτ ∗ log(˜ − 1 ) exp − N 100 τ ∗ π max log(˜ − 1 )˜ − 2 . (46) Pro of. F or simplicity , restrict < π min / 2 for no w. F rom Lemma 12, we ha ve the concentration results of b F 0 , b π 0 giv en in (43) and (44), we will first simplify them b efore applying them to this pro of. Since α = τ (min( / 2 , π max )) + 1 in Lemma 12 and we restrict < π min / 2 for now, w e see α = 14 τ ( / 2) + 1. W e can obtain an upp er b ound on α as follo ws: α = τ ( / 2) + 1 (i) ≤ log(2 ) log(1 / 2) + 1 τ ∗ + 1 (ii) ≤ log(2 ) log(1 / 2) + 3 τ ∗ = log 2 (0 . 5 − 1 ) + 3 τ ∗ ≤ log 2 ( − 1 ) + 3 τ ∗ (iii) ≤ 4 τ ∗ log 2 ( − 1 ) ≤ 6 τ ∗ log( − 1 ) , (47) where: (i) holds since / 2 < π min / 4 ≤ 1 / (4 n ) ≤ 1 / 4 so w e could apply Lemma 11; (ii) holds since mixing time τ ∗ ≥ 1; (iii) holds since − 1 > 2 π − 1 min ≥ 2 n ≥ 2 thus log 2 ( − 1 ) ≥ 1. Plugging (47) into the RHS of (43) and (44), we hav e 2 αn exp − N 2 / 8 2 π max α + α/ 6 ≤ 12 nτ ∗ log( − 1 ) exp − N 8 τ ∗ log( − 1 ) − 1 (12 π max − 1 + 1) ≤ 12 nτ ∗ log( − 1 ) exp − N 100 τ ∗ π max log( − 1 ) − 2 , (48) where the last line holds since 0 . 5 π max − 1 > π max /π min ≥ 1. In the remainder of the proof, we require the conditions k b F 0 − F k ≤ , k b π 0 − π k ∞ ≤ to b e satisfied. By applying union b ound to (43) and (44) and plugging in (48), w e can see P k b F 0 − F k ≤ , k b π 0 − π k ∞ ≤ ≥ 1 − 24 nτ ∗ log( − 1 ) exp − N 100 τ ∗ π max log( − 1 ) − 2 . (49) When k b π 0 − π k ∞ ≤ < π min / 2, it is easy to see min i b π 0 ( i ) ≥ π min / 2 > 0, whic h implies ev ery state has sho wed up at least once in the tra jectory since otherwise there would b e 0 elemen t in b π 0 . Also, by the definition of b F 0 , b π 0 , b P 0 , we can see b P 0 is determined only by the first line of (45) thus b P 0 = diag ( b π 0 ) − 1 b F 0 holds. Now, consider k b P 0 − P k , we hav e k b P 0 − P k = k diag ( b π 0 ) − 1 b F 0 − diag ( π ) − 1 F k ≤ k diag ( b π 0 ) − 1 ( b F 0 − F ) k + k diag ( b π 0 ) − 1 − diag ( π ) − 1 F k ≤ k diag ( b π 0 ) − 1 kk ( b F 0 − F ) k + k I − diag ( π / b π 0 ) kk diag ( π ) − 1 F k = min i b π 0 ( i ) − 1 k ( b F 0 − F ) k + max i | b π 0 ( i ) − π i | b π 0 ( i ) k P k ≤ 2 π − 1 min k ( b F 0 − F ) k + max i | b π 0 ( i ) − π i | min j b π 0 ( j ) k P k (i) ≤ 2 π − 1 min + 2 π − 1 min k P k (ii) ≤ 4 π − 1 min k P k , (50) 15 where: (i) holds as k b F 0 − F k ≤ , k b π 0 − π k ∞ ≤ and (ii) holds as k P k ≥ k P 1 √ n 1 k = k 1 √ n 1 k = 1. Therefore, for an y ≤ π min / 2, b y combining (50) and (49), w e hav e P k b P 0 − P k ≤ 4 π − 1 min k P k ≥ 1 − 24 nτ ∗ log( − 1 ) exp − N 100 τ ∗ π max log( − 1 ) − 2 . (51) Finally , w e could remov e the restriction ≤ π min / 2. W e hav e, for an y > 0 and let ˜ = min { , π min / 2 } , then P k b P 0 − P k ≤ 4 π − 1 min k P k ≥ P k b P 0 − P k ≤ 4 π − 1 min k P k ˜ ≥ 1 − 24 nτ ∗ log(˜ − 1 ) exp − N 100 τ ∗ π max log(˜ − 1 )˜ − 2 , (52) whic h concludes the pro of. Lemma 14. Consider al l the c onditions given in L emma 12 and 13 exc ept that ther e ar e N 0 p erturb a- tions in the tr aje ctory of Markov chain, i.e. P N t =0 1 { b X t 6 = X t } = N 0 , wher e b X 0: N denotes the p erturb e d tr aje ctory. L et b P ( i, j ) = P N t =1 1 { b X t − 1 = i, b X t = j } P N t =1 1 { b X t − 1 = i } if P N t =1 1 { b X t − 1 = i } 6 = 0 1 /n otherwise . (53) Then, when N 0 < N π min 2 , ∀ > 0 , let ˜ = min { π min / 2 − N 0 / N , } , we have P k b P − P k ≤ 4 π − 1 min k P k ( + 1 . 5 N 0 / N ) ≥ 1 − 24 nτ ∗ log(˜ − 1 ) exp − N 100 τ ∗ π max log(˜ − 1 )˜ − 2 . (54) Pro of. F or now, assume < π min / 2 − N 0 / N . Let b π 0 , b F 0 , b P 0 b e defined the same as Lemma 12 and 13 with the unp erturbed tra jectory X 0: N , then according to the pro of of Lemma 13, we hav e P k b F 0 − F k ≤ , k b π 0 − π k ∞ ≤ , k b P 0 − P k ≤ 4 π − 1 min k P k ≥ 1 − 24 nτ ∗ log( − 1 ) exp − N 100 τ ∗ π max log( − 1 ) − 2 . (55) Let b π ( i ) = 1 N P N t =1 1 { b X t − 1 = i } , b F ( i, j ) = 1 N P N t =1 1 { b X t − 1 = i, b X t = j } . Then, for an y i ∈ [ n ], w e can see b π ( i ) = 1 N N X t =1 1 { b X t − 1 = i } = 1 N N X t =1 1 { X t − 1 = i } + N X t =1 1 { X t − 1 6 = i, b X t − 1 = i } − N X t =1 1 { X t − 1 = i, b X t − 1 6 = i } ! = b π 0 ( i ) + 1 N N X t =1 1 { X t − 1 6 = i, b X t − 1 = i } − N X t =1 1 { X t − 1 = i, b X t − 1 6 = i } ! , (56) whic h gives | b π ( i ) − b π 0 ( i ) | ≤ N 0 N . (57) Then, with probabilit y no less than the b ound giv en in (55), we hav e b π ( i ) ≥ b π 0 ( i ) − N 0 N ≥ π ( i ) − − N 0 N ≥ π min 2 > 0 , (58) 16 | b π ( i ) − π ( i ) | ≤ | b π ( i ) − b π 0 ( i ) | + | b π 0 ( i ) − π ( i ) | ≤ N 0 N + , (59) and k b F − F k ≤ k b F − b F 0 k + k b F 0 − F k ≤ k b F − b F 0 k F + = 1 N v u u u t X i,j ∈ [ n ] N X t =1 1 { b X t − 1 = i, b X t = j } − N X t =1 1 { X t − 1 = i, X t = j } ! 2 + ≤ 1 N v u u u t X i,j ∈ [ n ] N X t =1 1 { b X t − 1 = i, b X t = j } − N X t =1 1 { X t − 1 = i, X t = j } 2 + (i) ≤ 1 N p (2 N 0 ) 2 + = 2 N 0 N + , (60) where (i) holds since N 0 p erturbations in the tra jectory can at most ruin 2 N 0 transition pairs in total. Since (58) guaran tees for any i ∈ [ n ] , b π ( i ) > 0, similar to the deriv ation in (50), w e hav e k b P − P k ≤ min i b π ( i ) − 1 k b F − F k + max i | b π ( i ) − π ( i ) | min i b π ( i ) k P k . (61) Com bining (58), (59) and (60), w e hav e k b P − P k ≤ 2 π − 1 min ( 2 N 0 N + ) + 2 π − 1 min ( N 0 N + ) k P k ≤ 2 π − 1 min ( 2 N 0 N + ) k P k + 2 π − 1 min ( N 0 N + ) k P k ≤ 4 π − 1 min ( 3 N 0 2 N + ) k P k . (62) So, P k b P − P k ≤ 4 π − 1 min k P k ( + 1 . 5 N 0 / N ) ≥ 1 − 24 nτ ∗ log( − 1 ) exp − N 100 τ ∗ π max log( − 1 ) − 2 . (63) Finally , we could remo ve the restriction < π min / 2 − N 0 / N . F or any > 0, let ˜ = min { , π min / 2 − N 0 / N } , then P k b P 0 − P k ≤ 4 π − 1 min k P k ( + 1 . 5 N 0 / N ) ≥ P k b P 0 − P k ≤ 4 π − 1 min k P k (˜ + 1 . 5 N 0 / N ) ≥ 1 − 24 nτ ∗ log(˜ − 1 ) exp − N 100 τ ∗ π max log(˜ − 1 )˜ − 2 , (64) whic h concludes the pro of. B Pro ofs for Main Theorems In this section, w e list the pro ofs for the main theorems app ear in the pap er. The main idea of the pro of for Theorem 3 is inspired b y [12], which is dev elop ed for discrete Marko v c hains. W e generalize the w ork in [12] to the case when Mark ov matrix is not exactly aggregatable and there are p erturbations in the Mark ov chain tra jectory , i.e., the mo de sequence is estimated from the observ ation tra jectory of an underlying switc hed system rather than b eing directly observ ed. 17 B.1 Pro of for Theorem 1 Pro of. F rom Section 3.6 in [26], one can easily obtain (8). F or (9), w e hav e k π t − ˜ π t k 1 ≤ k π t − π k 1 + k ˜ π t − ˜ π k 1 + k π − ˜ π k 1 . (65) By Mark ov conv ergence theorem [27], w e could upp er bound the first t wo terms and finish the pro of. B.2 Pro of for Lemma 2 Pro of. Suppose pair { y t , φ t } is generated b y w i , i.e. y t = w | i φ t + n t . Then based on Line 3 in Algorithm 1, b X t = X t when | y t − w | i φ t | < | y t − w | j φ t | , ∀ j 6 = i , which is equiv alen t to | n t | < | ( w i − w j ) | φ t + n t | . (66) A sufficien t condition to guarantee (66) is ( w i − w j ) | φ t > 2 n max , ∀ j 6 = i . B.3 Pro of for Theorem 3 W e first consider the case when there is no estimation error in empirical Marko v matrix b P , i.e. b P = P , then generalize this to Theorem 3. Lemma 15. Assume: (i) the fr amework in Se ction 2.2 holds; (ii) in A lgorithm 1, b P = P , i.e. the clustering is applie d to the true Markov matrix; (iii) { b Ω 1 , . . . , b Ω r } is a (1 + 1 ) solution to the k-me ans pr oblem in Algorithm 1. Then, if k ∆ k ≤ σ r ( ¯ P ) 8 p (2 + 1 ) r s | Ω ( r ) | | Ω (1) | + 1 , (67) we have M R ( b Ω 1 , b Ω 2 , . . . , b Ω r ) ≤ 64(2 + ) r k ∆ k 2 σ r ( ¯ P ) 2 . (68) Pro of. F rom conditions in Lemma 15, we see r ank ( ¯ P ) = r . Let ¯ P = ¯ U r ¯ Σ r ¯ V | r b e the SVD of ¯ P , whic h only preserves the first r singular v alue components, so ¯ U r ∈ R n x r , ¯ Σ r ∈ R r x r , ¯ V r ∈ R n x r . Recall U r defined in the algorithm con tains the first r left singular vectors of P . W e define Q = arg min O ∈ O ( r ) k U r − ¯ U r O k 2 F , where O ( r ) denotes the orthogonal group with dimension r . Then, w e ha ve k U r − ¯ U r Q k F (i) ≤ √ 2 k sin Θ ( U r , ¯ U r ) k F ≤ √ 2 r k sin Θ ( U r , ¯ U r ) k (ii) ≤ 2 √ 2 r k ¯ P − P k σ r ( ¯ P ) = 2 √ 2 r k ∆ k σ r ( ¯ P ) , (69) where (i) follo ws from Lemma 8; (ii) follows from Lemma 7. Also note that since Q is orthogonal, for all i ∈ Ω k , j ∈ Ω l w e hav e k ( ¯ U r Q )( i, :) − ( ¯ U r Q )( j, :) k = k [ ¯ U r ( i, :) − ¯ U r ( j, :)] Q k = k ¯ U r ( i, :) − ¯ U r ( j, :) k = ( 0 if k = l q 1 | Ω k | + 1 | Ω l | if k 6 = l , (70) 18 where the last line follows from Lemma 9. Recall { Ω 1 , . . . , Ω r } is the partition of ro ws of ¯ P such that ro ws within the same cluster are the same. W e can see in matrix ¯ U r Q , rows are the same if corresp onding ro ws in ¯ P are in the same cluster, while different if corresp onding ro ws in ¯ P are in differen t clusters. So, w e can claim that the matrix ¯ U r Q carries the same aggregation information as ¯ P . Because of this, together with the fact that we apply k-means to U r , we can apply Lemma 10 b y replacing { ¯ U , U } in Lemma 10 with { ¯ U r Q , U r } . T o make condition (37) in Lemma 10 hold, based on (70), we can pick δ k = s 1 | Ω k | + 1 | Ω (1) | ∀ k ∈ [ r ] . (71) T o make condition (39) in Lemma 10 hold, with this c hoice of δ k , it suffices to guaran tee for all k ∈ [ r ] | Ω k | | Ω (1) | > 8(2 + ) k ¯ U r Q − U r k 2 F − 1 (72) whic h, by applying (69), can b e guaranteed by the following condition: k ∆ k ≤ σ r ( ¯ P ) 8 p (2 + 1 ) r s | Ω ( r ) | | Ω (1) | + 1 . (73) Therefore, with (73), and according to Lemma 10, we can claim states in set G defined in Lemma 10 can b e correctly aggregated under relab eling in v ariance. Moreov er, M R ( ˆ Ω 1 , . . . , ˆ Ω r ) ≤ r X k =1 | S k | | Ω k | ≤ r X k =1 | S k | δ 2 k (i) ≤ 8(2 + ) k ¯ U r Q − U r k 2 F (ii) ≤ 64(2 + ) r k ∆ k 2 σ r ( ¯ P ) 2 , (74) where S k is defined in Lemma 10, (i) follo ws from (38) and (ii) follows from (69). No w, with the analyses under the assumption that no estimation error in b P exists, we go back to the pro of for the general case, i.e. Theorem 3. Pro of for Theorem 3. Let b ∆ = ∆ + ( b P − P ), then w e can see b P = ¯ P + b ∆ . By applying Lemma 15 to b P and b ∆ , w e can see when k b ∆ k 2 ≤ σ r ( ¯ P ) 2 64(2 + 1 ) r | Ω ( r ) | | Ω (1) | + 1 , (75) w e hav e M R ( b Ω 1 , b Ω 2 , . . . , b Ω r ) ≤ 64(2 + 1 ) r k b ∆ k 2 σ r ( ¯ P ) 2 . (76) T o guarantee (75), by triangle inequality , it suffices to ensure k b P − P k ≤ σ r ( ¯ P ) 8 p (2 + 1 ) r s | Ω ( r ) | | Ω (1) | + 1 − k ∆ k . (77) 19 No w, for all 2 > 0, let ˜ 2 = min 2 , π min 2 − η , π min 4( σ 1 ( ¯ P )+ k ∆ k ) σ r ( ¯ P ) 8 √ (2+ 1 ) r r | Ω ( r ) | | Ω (1) | + 1 − k ∆ k . By Lemma 14, we can see when N ≥ 200 τ ∗ π max log(˜ − 1 2 )˜ − 2 2 [log(24 nτ ∗ ) + log(log ( ˜ 2 − 1 ))], with probability no less than 1 − exp − N 200 τ ∗ π max log(˜ − 1 2 )˜ − 2 2 , (78) w e hav e k b P − P k ≤ 4 π − 1 min k P k ( 2 + 1 . 5 η ) . (79) By the choice of ˜ 2 , we also can see (78) gives the probability low er b ound on the o ccurrence of (77), whic h further low er b ounds the o ccurrence probabilit y of (75). Finally , plugging (79) into (76), we hav e M R ( b Ω 1 , b Ω 2 , . . . , b Ω r ) ≤ 64(2 + 1 ) r k b ∆ k σ r ( ¯ P ) ! 2 ≤ 64(2 + 1 ) r k ∆ k + k b P − P k σ r ( ¯ P ) ! 2 ≤ 64(2 + 1 ) r k ∆ k + 4 π − 1 min k P k ( 2 + 1 . 5 η ) σ r ( ¯ P ) ! 2 ≤ 64(2 + 1 ) r k ∆ k σ r ( ¯ P ) + 4( 2 + 1 . 5 η )( k ∆ k + k ¯ P k ) π min σ r ( ¯ P ) 2 , (80) whic h concludes the pro of. B.4 Pro of for Theorem 4 Pro of. By triangle inequalit y , w e see k P − e P k ∞ ≤ k P − b P k ∞ + k b P − e P k ∞ . (81) Assume i ∈ b Ω s , from Line 9 in Algorithm 1, we can see e P ( i, :) is a conv ex com bination of b P ( j, :) , ∀ j ∈ Ω s . Therefore, k b P ( i, :) | − e P ( i, :) | k 1 ≤ max k,j ∈ b Ω s k b P ( k , :) | − b P ( j, :) | k 1 ≤ max k,j ∈ b Ω s k b P ( k , :) | − P ( k , :) | k 1 + k b P ( j, :) | − P ( j, :) | k 1 + k P ( k , :) | − P ( j, :) | k 1 ≤ 2 k b P − P k ∞ + max k,j ∈ b Ω s k ¯ P ( k , :) | − P ( k , :) | k 1 + k ¯ P ( j, :) | − P ( j, :) | k 1 + k ¯ P ( k , :) | − ¯ P ( j, :) | k 1 ≤ 2 k b P − P k ∞ + 2 k ∆ k ∞ , (82) whic h gives k P − e P k ∞ ≤ 3 k b P − P k ∞ + 2 k ∆ k ∞ . (83) Plugging in k b P − P k ≤ 4 π − 1 min k P k ( 2 + 1 . 5 η ) deriv ed in the proof of Theorem 3, w e conclude the pro of. 20
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment