High Precision Variational Bayesian Inference of Sparse Linear Networks

High Precision V ariational Ba y esian Inference of Sparse Linear Net w orks ? Jun yang Jin a , Y e Y uan b ,? , and Jorge Gon¸ calv es a a The Luxemb our g Centr e for Systems Biome dicine, avenue du Swing, 4367 Belvaux, Luxemb our g b Scho ol of Automation, Huazhong University of Scienc e and T e chnolo gy, 430074, Hub ei, China Abstract Sparse net w orks can b e found in a wide range of applications, such as biological and comm unication net w orks. Inference of suc h net works from data has b een receiving considerable atten tion lately , mainly driven b y the need to understand and con trol in ternal working mechanisms. Ho w ev er, while most av ailable methods ha ve b een successful at predicting many correct links, they also tend to infer man y incorrect links. Precision is the ratio betw een the n um ber of correctly inferred links and all inferred links, and should ideally be close to 100%. F or example, 50% precision means that half of inferred links are incorrect, and there is only a 50% chance of pic king a correct one. In contrast, this pap er infers links of discrete-time linear netw orks with v ery high precision, based on v ariational Ba yesian inference and Gaussian pro cesses. Our metho d can handle limited datasets, do es not require full-state measurements and eﬀectively promotes both system stabilit y and net work sparsit y . On sev eral of examples, Monte Carlo simulations illustrate that our method consisten tly has 100% or nearly 100% precision, ev en in the presence of noise and hidden no des, outp erforming sev eral state-of-the-art metho ds. The metho d should be applicable to a wide range of net w ork inference con texts, including biological net works and pow er systems. Key wor ds: System Iden tiﬁcation; V ariational Inference; Dynamical Structure F unction; Net work Inference; Sparse Net works 1 INTR ODUCTION In systems biology , mathematical mo delling has been cen tral to the study of biological netw orks. Dynami- cal models are frequently dev elop ed to predict the b e- ha viour of systems in resp onse to external or internal stim ulus for example, drug treatment or mutation. Y et only input-output dynamics are learned without explor- ing the top ology , whereas in man y other applications, comprehensiv e kno wledge of the net work top ology is re- quired. F or example, the information about the struc- ture of control systems is essential for fault diagnosis. Hence, b oth the inference of system dynamics and the detection of netw ork top ology are imp ortan t. Precision is the ratio b et ween the num b er of correctly inferred links and all inferred links. It indicates whether inferred netw orks can b e trusted. F or example, if pre- cision is close to or even below 50%, it is imp ossible to tell whic h inferred links are correct. Therefore, preci- sion should b e the ﬁrst priority when solving net work inference. Ho w ever, most state-of-the-art metho ds can rarely ac hieve 100% precision, meaning that not all in- ferred links are correct. The motiv ation of this work ? F or corresp ondence: yye@h ust.edu.cn. is to dev elop a metho d that prioritises precision o ver true p ositiv e rate (TPR). Several examples consistently ac hiev ed 100% or nearly 100% precision outp erforming other state-of-the-art metho ds. Sparsit y and stabilit y are fundamental properties of most real-world net works. Comm unication netw orks, as artiﬁcial systems, are designed to b e stable for robust op eration and sparse to reduce energy consumption. Th us, sparsity and stability are primary constraints in net w ork inference, esp ecially in the case of limited data source or high amoun t of noise. When dealing with prac- tical netw orks, often, not all the no des in the netw ork can b e measured, b ecause of either high exp erimen tal cost or tec hnical limitations. The diﬃcult y here is that man y inference methods nonetheless assume full-state measuremen ts. It is imp ortan t to reconsider this issue carefully . In recent y ears, kernel-based methods hav e b ecome p op- ular in the system identiﬁcation communit y [23]. F or lin- ear systems, the metho ds eﬀectiv ely imp ose system sta- bilit y and greatly simplify the estimation of mo del com- plexit y . Kernel-based metho ds ha ve successfully identi- ﬁed SISO contin uous linear time in v arian t (L TI) mo d- els [23] and b een further extended to discrete L TI sys- tems [5, 8, 23]. In particular, kernel-based methods hav e Preprin t submitted to Automatica 20 Decem b er 2024 b een used to infer sparse netw orks describ ed by Granger causalit y [9]. They hav e b een further developed to infer the so-called sparse plus low rank netw orks where it is assumed that the ma jority of no des can b e described b y a few other comp onen ts that are not accessible for obser- v ation [37]. In addition, system identiﬁcation of a v ariet y of mo del classes hav e been considered: the mo dels in- clude NFI, NARX, NARMAX, linear parameter-v arying (LPV) Box-Jenkins mo dels, Hammerstein mo dels, and cascaded linear systems [11, 25, 27, 28]. System dynamics and netw ork top ology are con trolled b y the hyperparam- eters of kernel functions. Under the Bay esian paradigms, k ernel-based metho ds apply emprical Ba yes to estimate h yp erparameters (KEB). By incorp orating Automatic Relev ance Determination (ARD), k ernel-based metho ds are able to enforce sparsity , where the sparsit y proﬁle of the solution implies netw ork top ology [5, 9, 37]. Never- theless, this framework is not ideal for top ology detec- tion. Due to lo cal optimal solutions, it is very diﬃcult to achiev e 100% precision. V ariational inference (VI), as empirical Bay es, is a Ba y esian deterministic approximation tec hnique that has b een applied to a num b er of cases, including sparse regression mo dels [1, 31] and neural net works [2, 14, 18]. Instead of estimating h yp erparameters directly , VI searc hes for an approximation of the p osterior distribu- tion of hyperparameters (functional estimation). With w ell-p osed mo dels of exp onen tial families, VI is as eﬃ- cien t computationally as empirical Ba y es [15]. Whilst rigorous ev aluation remains elusive, Monte Carlo sim- ulations show that VI can b e more accurate than em- pirical Bay es [15, 16]. More imp ortan tly , VI is able to estimate model evidence for each p ossible model struc- ture: this enables ev aluation of the relative conﬁdence b et ween mo dels. How ever, VI is barely used in kernel- based system iden tiﬁcation, probably due to nonlinear- ities introduced by kernel functions (non-gaussianity). VI do es no hold a closed-form up date in the algorithm, and it has to deal with high-dimensional ill-conditioned co v ariance matrices, whic h seriously increases compu- tational burden and degrades n umerical stability . Nev- ertheless, thanks to recen t developmen ts on analysis of k ernel functions [3], VI achiev es considerably higher ef- ﬁciency and robustness by using sp ecial kernel functions (T uned/Correlated kernel). This paper com bines Gaussian pro cesses and VI to infer sparse netw orks. Dynamical structure functions are used to describ e sparse linear netw orks where the informa- tion of hidden no des is enco ded via transfer functions. By expressing DSF mo dels in a non-parametric wa y , the system can b e identiﬁ ed without kno wing the num b er of hidden no des and the connectivity among them . VI is emplo yed to identify system dynamics and infer net- w ork top ology . Moreov er, by applying backw ard selec- tion strategies, the prop osed metho d encourages high inference precision. Monte Carlo simulations show that our metho d produces more reliable netw orks than KEB under v arious exp erimen tal conditions, suc h as diﬀeren t top ologies, noise lev els, kernel functions and num b er of data points. Most importantly , the prop osed metho d al- w a ys ac hieves 100% or nearly 100% precision so that al- most all inferred links are correct. The paper is organized as follows. Section 2 introduces v ariational inference algorithms. Section 3 in tro duces dynamical structure function and form ulates the full Ba y esian mo del. Section 4 discusses net w ork inference using v ariational inference and analyses algorithm prop- erties. Section 5 compares the metho d with other ap- proac hes via Mon te Carlo simulations. Finally , Section 6 concludes and discusses further developmen ts in this ﬁeld. Notation : The notation in this pap er is as follo ws. I n de- notes a n × n iden tit y matrix. F or L ∈ R n × n , diag { L } denotes a vector whic h consists of the diagonal elemen ts of matrix L . [ L ] ij presen ts the ij th entry and L j i de- notes the i th j × j diagonal block of L . F or a series of matrices, { L i | i = 1 , ..., n } , bl kdiag { L 1 , ..., L n } de- notes a blo c k diagonal matrix. F or l ∈ R n , diag { l } de- notes a diagonal matrix whose diagonal elements come from v ector l . l ij denotes the j th elemen t of the i th group of l . A vector, y ( t 1 : t 2 ) denotes a row vector h y ( t 1 ) y ( t 1 + 1) · · · y ( t 2 ) i . N ( x | m, Σ) denotes a Gaus- sian distribution of x with mean m and co v ariance ma- trix Σ. asc { a 1 , ..., a n } means to rearrange the elemen ts in an ascending order of the magnitude. 2 O VER VIEW OF V ARIA TIONAL INFER- ENCE V ariational inference (VI) approximates a full Bay esian mo del analytically so that the intractable marginaliza- tion or expectation can be easily calculated [19, 32]. Em- pirical Bay es w as more frequently used in the kernel- based system iden tiﬁcation, where system dynamics are the main concern. Under the context of netw ork infer- ence, mo del selection (detection of netw ork top ology) is another imp ortan t asp ect. Mo del evidence is usually required to compare diﬀeren t mo del structures. Whilst empirical Ba y es does not ev aluate mo del evidence, v ari- ational inference generates a low er b ound for it, which motiv ates adv anced strategies for mo del selection. Con- sider a mo del structure M k with mo del parameters, θ and data, y . The mo del evidence, p ( y |M k ) is expressed as: p ( y |M k ) = Z p ( y | θ, M k ) p ( θ |M k ) dθ . (1) 2 Assuming an arbitrary distribution Q ( θ |M k ) is used to appro ximate p ( θ | y , M k ), we hav e: ln p ( y |M k ) = ln p ( θ | y, M k ) p ( y |M k ) p ( θ | y, M k ) = ln p ( y | θ, M k ) p ( θ |M k ) p ( θ | y, M k ) = Z Q ( θ |M k ) ln p ( y | θ, M k ) p ( θ |M k ) p ( θ | y, M k ) dθ = Z Q ( θ |M k ) ln p ( y | θ, M k ) p ( θ |M k ) Q ( θ |M k ) dθ + Z Q ( θ |M k ) ln Q ( θ |M k ) p ( θ | y, M k ) dθ = L [ Q ( θ | y, M k )] + K L [ Q ( θ |M k ) || p ( θ | y, M k )] . (2) where L [ Q ( θ | y , M k )] = Z Q ( θ |M k ) ln p ( y | θ , M k ) p ( θ |M k ) Q ( θ |M k ) dθ K L [ Q ( θ |M k ) || p ( θ | y , M k )] = Z Q ( θ |M k ) ln Q ( θ |M k ) p ( θ | y , M k ) dθ. (3) The second term of (2) is the Kullbac k-Leibler (KL) divergence b et ween Q ( θ |M k ) and p ( θ | y , M k ). KL div ergence is non-negativ e and equal to zero if and only if Q ( θ |M k ) = p ( θ | y, M k ). Therefore, KL divergence measures the diﬀerence b et ween the true and the appro ximate distributions. More im- p ortan tly , since K L [ Q ( θ |M k ) || p ( θ | y, M k )] ≥ 0, ln p ( y |M k ) ≥ L [ Q ( θ | y , M k )], meaning L [ Q ( θ | y , M k )] is the lo w er b ound of the logarithm of mo del evidence. Hence, it can b e used as an approximation of mo del evidence for mo del selection. VI emplo ys KL divergence as the metric to measure the accuracy of the approximation. Therefore, the goal is to ﬁnd the optimal Q ( θ | y, M k ) that minimizes the KL div ergence: Q opt ( θ | y, M k ) = arg min Q K L [ Q ( θ |M k ) || p ( θ | y, M k )] . (4) Equation (2) implies that K L [ Q ( θ | y, M k ) || p ( θ | y, M k )] = ln p ( y |M k ) − L [ Q ( θ | y , M k )] where ln p ( y |M k ) is inde- p enden t on Q . Hence, the optimization problem is equiv- alen t to maximizing the low er b ound L [ Q ( θ | y , M k )]: Q opt ( θ | y, M k ) = arg max Q L [ Q ( θ | y, M k )] . (5) Without further constraints on Q ( θ | y , M k ), the solution is Q ( θ | y, M k ) = p ( θ | y, M k ), which oﬀers no help to re- solv e intractable Ba y esian estimation. T o relax the com- plicated Bay esian mo del, Q ( θ | y , M k ) is assigned with a simple structure. VI expresses Q ( θ | y, M k ) in a factor- ized form based on the mean ﬁeld theorem in physics: Q ( θ | y, M k ) = Y q ( θ i |M k ) . (6) where q ( θ i |M k ) are indep enden t distributions for eac h elemen t of θ . Consequently , problem (5) b ecomes: Q opt ( θ | y, M k ) = arg max Q L [ Q ( θ | y, M k )] sub ject to: Q ( θ | y, M k ) = Y q ( θ i |M k ) Z q ( θ i |M k ) dθ i = 1 , i = 1 , 2 , ... (7) By substituting the constraints, the cost function b e- comes conv ex with resp ect to eac h factor, q ( θ i |M k ). Ac- cording to the theory of v ariational calculus, the solution to the problem is: ln q ( θ i |M k ) = E j 6 = i [ln p ( y , θ )] + c θ j 6 = i . (8) where the exp ectation is conducted with resp ect to the factors, q ( θ j |M k ) ( j 6 = i ). c θ j 6 = i is a term independent on θ i . Although equations in (8) indicates the consistency conditions of the optimal solution to problem (7), they cannot b e solved analytically . T o seek for the solution, the factors are up dated cyclically follo wing the scheme of the co ordinate descent metho d. Since the cost function is conv ex, conv ergence is guaranteed [1]. 3 MODEL F ORMULA TION 3.1 The dynamic al structur e function The sparse netw ork of n no des is describ ed b y a linear state space mo del as follows: x ( t + 1) = Ax ( t ) + B u u ( t ) + B e e ( t ) y ( t ) = C x ( t ) . (9) where x ∈ R n are states of the system, each of which rep- resen ts a node. u ∈ R m denote inputs. y ∈ R p presen t the measurements of the states. e ∈ R q are i.i.d white Gaussian noise with zero mean and co v ariance matrix P e . Without loss of generalit y , P e is assumed to be diag- onal. If the co v ariance matrix is full, one can decomp ose the matrix using singular v alue decomposition (SVD) as P e = R Σ R 0 . By replacing B e with B e R , noise e hav e a diagonal cov ariance matrix. A ∈ R n × n , B u ∈ R n × m , B e ∈ R n × q and C ∈ R n × p are system matrices. It is normal in practice that some of the no des are un- observ able (hidden states). F or example, in a gene regu- latory netw ork, the concen tration of proteins is usually not measured due to high exp erimen tal cost. Therefore, the target of inference is to build a net w ork consisting of measurable no des. In regard to the gene regulatory net w ork, this means the net work is inferred on the tran- scriptional level. Assuming the ﬁrst p < n states are observ able (i.e. C = 3 [ I , 0 ]), mo del (9) is rewritten as follows: h y ( t + 1) h ( t + 1) i = h A 11 A 12 A 21 A 22 i h y ( t ) h ( t ) i + h B u 1 B u 2 i u ( t ) + h B e 1 B e 2 i e ( t ) , (10) where h ∈ R n − p are hidden states. T o av oid inferring hidden states, they are remov ed from the mo del. Dy- namical structure functions (DSF) enco de the informa- tion of hidden states via transfer functions [35]: Y = QY + P U + H E . (11) where q denotes the time shift op erator ( y ( t + 1) = q y ( t )) and: Q = ( q I − D ) − 1 ( W − D ) P = ( q I − D ) − 1 V u H = ( q I − D ) − 1 V e . (12) with W = A 11 + A 12 ( q I − A 22 ) − 1 A 21 V u = A 12 ( q I − A 22 ) − 1 B u 2 + B u 1 V e = A 12 ( q I − A 22 ) − 1 B e 2 + B e 1 D = diag { W 11 , W 22 , ..., W pp } . (13) Q , P and H are transfer matrices, each element of whic h is a strictly prop er transfer function, indicating that the netw ork is a causal system [34]. Matrix Q implies the connectivity among observ able no des, whose diago- nal elements are zero. P and H matrices relate inputs and pro cess noise to no des, resp ectiv ely . The top ology of the net w ork (i.e. mo del structure) is reﬂected b y the zero structure of these three matrices. F or example, if [ Q ] ij is zero, the j th node do es not control the i th node. Mo del structures are denoted b y M k and M k presen ts the num b er of links. In particular, M 0 represen ts the fully-connected top ology . The internal dynamics of the net w ork are describ ed by the transfer functions. The or- der of a transfer function is relev ant to the num b er of hidden states inv olved in that regulation pathw ay . The input-output map of the net work is asso ciated with the DSF as follows: Y = G u U + G e E . (14) where G u = ( I − Q ) − 1 P G e = ( I − Q ) − 1 H . (15) Ideally , the input-output map can b e p erfectly recov- ered based on measuremen t data. Nevertheless, the cor- resp onding DSF may not b e unique, meaning that the net w ork top ology is uniden tiﬁable. T o ensure the infer- ence problem is well-posed, additional constraints are imp osed. Prop osition 1 (Identiﬁability of DSF networks) [13] Given a p × ( m + q ) tr ansfer matrix G = [ G u , G e ] , the DSF is identiﬁable if and only if p − 1 elements in e ach c olumn of [ Q, P , H ] 0 ar e known, which uniquely sp e ciﬁes the c omp onent of ( Q, P , H ) in the nul l sp ac e of [ G 0 , I ] . A suﬃcien t condition for net w ork identiﬁabilit y is that matrix H is diagonal so that p − 1 elemen ts in eac h column of [ Q, P , H ] 0 are kno wn to be zero. In what fol- lo ws, we make following assumptions so that no prior kno wledge of matrix P (structure of input c hannels) is required. Assumption 2 Noise matrix H is diagonal, monic ( lim q →∞ q H = I ) and minimal phase. Stabilit y and sparsity are the basic nature of many prac- tical net works suc h as biological netw orks and p o wer sys- tems. Therefore, w e assume the target netw ork is stable and sparse. Assumption 3 Each elements of tr ansfer matric es, Q and P ar e stable. Matric es Q and P ar e sp arse. 3.2 The likeliho o d distribution After simple manipulations, the DSF in (11) can b e rewritten as: Y = F y Y + F u U + ¯ E . (16) where F u = ( q H ) − 1 P , F y = I − ( q H ) − 1 ( I − Q ) , ¯ E = q − 1 E . (17) Mo del (16) is a v alid causal system. According to the assumptions, transfer matrices, F u and F y are stable. In addition, since H is diagonal, F u and F y ha v e the same zero structure as P and Q . As a result, F u and F y are sparse matrices and imply the netw ork top ology . Iden tifying model (16) is non-trivial. Since the num b er of hidden states and the connectivit y among them are unkno wn, estimating the order of transfer functions re- quires an exhaustiv e searc h of all p ossible combinations, whic h is computationally prohibitiv e for large-scale net- w orks. Additionally , imp osing stable transfer matrices is problematic. T o simplify the iden tiﬁcation problem, w e express mo del (16) in a non-parametric w ay . By do- ing so, the selection of model complexit y is av oided and, more imp ortan tly , system stabilit y can b e promoted ef- fectiv ely . The dynamical system for the i th target node, is formulated b elo w: y i ( t ) = p X j =1 ∞ X k =1 h y ij ( k ) y j ( t − k ) + m X j =1 ∞ X k =1 h u ij ( k ) u j ( t − k ) + ¯ e i ( t ) . (18) where h y ij and h u ij are the impulse responses of transfer functions [ F y ] ij and [ F u ] ij , resp ectiv ely . The ob jectiv e is to estimate the impulse resp onses. 4 Due to the implementation purpose, the impulse re- sp onses are truncated after sample time T . T is set suﬃ- cien tly large in order to catch the ma jor dynamics of the impulse resp onses (i.e. | h ( k ) | ≈ 0 for k ≥ T ). Assume the av ailability of time-series data collected from L in- dep enden t exp erimen ts for eac h node and input. F or the i th target no de with M 0 , we deﬁne follo wing matrices and vectors. F or other p ossible mo del structures, M k , the corresp onding terms are deﬁned in the same wa y . Y q =       y q,i ( N q ) . . . y q,i ( T q + 1)       , w q =       w q, 1 . . . w q,p + m       Φ q = h Φ q,y Φ q,u i Φ q,y =       y q, 1 ( N q − 1 : N q − T q ) · · · y q,p ( N q − 1 : N q − T q ) . . . . . . . . . y q, 1 ( T q : 1) · · · y q,p ( T q : 1)       Φ q,u =       u q, 1 ( N q − 1 : N q − T q ) · · · u q,m ( N q − 1 : N q − T q ) . . . . . . . . . u q, 1 ( T q : 1) · · · u q,m ( T q : 1)       σ − 1 = E { ¯ e i ( t ) 2 } . (19) where subscript q is the index of exp erimen ts. Under dif- feren t exp erimen tal conditions, data are pro duced from diﬀeren t internal dynamics (i.e. independent impulse re- sp onses) whilst the net work topology is unchanged. N q is the num b er of data p oin ts. Y q ∈ R N q − T q are time-series of the i th no de. w q ∈ R T q ( p + m ) con tain p + m groups of impulse resp onses, each of which corresp onds to a trans- fer function of F y or F u . Φ q ∈ R ( N q − T q ) × T q ( p + m ) include time-series of all the no des and inputs. σ is the noise precision. Note that the dimension of these quantities v aries with resp ect to the mo del structure. Based on Bay es’ rules, the likelihoo d distribution of the i th target no de with M k is: p ( Y   w , σ, M k ) = L Y q =1 (2 π σ − 1 ) − N q − T q 2 exp {− σ 2 k Y q − Φ q w q k 2 2 } . (20) 3.3 The prior distributions F ull Bay esian treatment deplo ys prior distributions for eac h random quantit y to build up a hierarchical struc- ture. The prior distributions pla y a similar role of penal- ties in regularized optimization problems. They are the k ey elemen ts to incorp orate prior kno wledge and imp ose desired constraints. Since the impulse resp onses of model (16) are stable (i.e. P ∞ k =1 | h ( k ) | < ∞ ), regularizations for stability are imp osed to incorp orate the prior kno wledge. Ker- nel machines provide an eﬀectiv e w a y to construct a functional space as the feasible domain of stable im- pulse responses [10]. A repro ducing kernel Hilb ert space (RKHS) is established using a prop er kernel function, whic h contains stable impulse resp onses [12]. The im- pulse resp onses of the mo del are estimated by solving a regularized optimization problem in that RKHS [24]. F rom the Ba yesian p erspectives, k ernel machines can b e form ulated b y introducing Gaussian pro cesses for impulse resp onses and solving a maximum a p oste- riori problem (MAP) [8, 22, 26]. Therefore, we as- sume the impulse resp onses of the mo del are indep en- den t Gaussian pro cesses whose co v ariance functions are T uned/Correlated k ernels (TC k ernel). TC ker- nel has b een widely used to characterize stable im- pulse resp onses [3]. Other v alid kernels include Diago- nal/Correlated kernel (DC k ernel) 1 and second order stable spline kernel (SS kernel) 2 [12, 23]. The reason wh y TC k ernel is applied in this pap er will b e expl ained in the following sections. As a result, the prior distribu- tion for w is: p ( w | λ, β , σ, M k ) = L Y q =1 M k Y i =1 N ( w q ,i | 0 , σ − 1 λ − 1 i K q ,i ) . (21) where β are hyperparameters of TC kernels, which con- trol the exp onen tial decaying rate of impulse resp onses. λ are scale v ariables of the kernel functions. In kernel mac hines, they in tro duces the eﬀect of Automatic Rel- ev ance Determination (ARD) that promotes sparsity estimation [9]. In the Ba y esian mo del, λ inﬂuence the probabilit y of mo del structure (i.e. net work top ology). As λ i approac hes inﬁnity , distribution p ( w i | λ i , β i , σ ) de- plo ys an impulse at the origin, enforcing zero impulse resp onses. In this case, the i th no de or input do es not con trol the target no de. Note that hyperparameters β and λ are shared in all exp erimen ts. As the standard setting of v ariational inference, noise precision σ is also used to scale the cov ariance matrix. K q ,i ∈ R T q × T q , λ = [ λ 1 , ..., λ M k ] 0 , β = [ β 1 , ..., β M k ] 0 and [ K q ,i ] ts = k ( t, s ; β i ) , k ( t, s ; β i ) = β max ( t,s ) i 0 <β i < 1 , λ i ≥ 0 . (22) Since σ is non-negative, the Gamma distribution is as- signed as its conjugate prior. Without sp eciﬁc preference on σ , parameters a 0 and b 0 of the distribution are set to 0 . 001, resulting in a non-informative prior. p ( σ | a 0 , b 0 ) = Gamma ( σ | a 0 , b 0 ) = b a 0 0 Γ( a 0 ) σ a 0 − 1 e − b 0 σ . (23) where Γ( · ) is the gamma function. 1 k DC ( t, s ; β 1 , β 2 ) = β ( t + s ) 2 1 β | t − s | 2 , β 1 ∈ (0 , 1) and β 2 ∈ ( − 1 , 1) 2 k S S ( s, t ; β ) = β t + s + max ( t,s ) 2 − β 3 max ( t,s ) 6 , β ∈ (0 , 1) 5 Finally , hyperpriors are assigned to h yp erparameters to complete the hierarc h y . F or hyperparameter λ i , the Gamma distribution is applied as the conjugate prior. Similar to σ , a non-informative prior is adopted. p ( λ i | a 0 , b 0 ) = Gamma ( λ i | a 0 , b 0 ) = b a 0 0 Γ( a 0 ) λ a 0 − 1 i e − b 0 λ i . (24) F or h yp erparameter β i , the uniform distribution on (0 , 1) is emplo y ed as the prior, i.e., p ( β i ) = 1 , 0 < β i < 1 . 3.4 The ful l Bayesian mo del By incorp orating the likelihoo d and prior distributions, the full Bay esian distribution for mo del (18) is: p ( w , σ, λ, β | Y , M k ) ∝ L Y q =1 n (2 π σ − 1 ) − N q − T q 2 exp {− σ 2 k Y q − Φ q w q k 2 2 } ×| 2 π σ − 1 Λ − 1 q K q | − 1 2 exp {− σ 2 w 0 q Λ q K − 1 q w q } o × b a 0 0 Γ( a 0 ) σ a 0 − 1 e − b 0 σ × M k Y i =1 b a 0 0 Γ( a 0 ) λ a 0 − 1 i e − b 0 λ i . (25) where K q = blk diag { K q , 1 , ..., K q ,M k } and Λ q = diag { λ } ⊗ I T q . 3.5 Estimation of hyp erp ar ameters Giv en (25), impulse resp onses are estimated as the mean of the marginal p osterior distribution (i.e. p ( w | Y ) = R p ( w , σ, λ, β | Y ) dσ dλdβ ). Nevertheless, dis- tribution p ( w | Y ) is intractable b ecause the full Bay esian mo del is highly nonlinear with resp ect to h yp erpa- rameters. Therefore, the estimate of impulse resp onses cannot b e calculated in a closed form. In the system identiﬁcation comm unit y , deterministic Ba y esian appro ximations hav e b een widely used as the remedy to accommo date non-gaussianity [1]. A candi- date distribution, p ( w ) is prop osed to approximate the marginal distribution, p ( w | Y ) analytically . A metric is designed to measure the error b et ween the appro ximate and the true distributions. The approximate distribu- tion is optimized by minimizing the metric. Finally , the estimate of w is calculated as the mean of the optimal candidate distribution. Empirical Bay es and v ariational inference are t wo t ypical metho ds of deterministic Ba y esian approxima- tions whilst empirical Ba yes is more prev alen t in the k ernel-based system identiﬁcation. The conditional dis- tribution, p ( w | λ, β , σ, Y ) is used to approximate the marginal distribution, p ( w | Y ). The h yp erparameters are optimized by solving a type I I or evidence maximiza- tion problem (i.e. ( λ opt , β opt , σ opt ) = ar gmin λ,β ,σ − log p ( λ, β , σ | Y )). Consequently , the optimal conditional distribution is p ( w | σ opt , λ opt , β opt , Y ). The estimate of w can b e easily calculated as ˆ w = E w | σ,λ,β ,Y ( w ). Empirical Ba y es pro vides an eﬀectiv e w a y to estimate the hyperpameters of kernels. This framework has b een sho wn v ery pow erful in exploring system dynamics [21]. Due to the eﬀect of ARD, the estimated λ can also b e used for mo del selection (detec tion of net work top ol- ogy), leading to sparse net works. The zero structure of λ − 1 indicates the netw ork top ology . Nevertheless, empir- ical Bay es do es not ev aluate mo del evidence, p ( Y |M k ). Hence, it is diﬃcult to assess the relativ e conﬁdence of the estimated mo del structure ov er the other p ossible structures. Compared with empirical Bay es, VI provides a reason- able appro ximation of mo del evidence that is essential for top ology detection.VI has b een applied to estimate mo dels that can b e cast as a sparse linear regression, where it has b een shown that VI outp erforms empiri- cal Ba y es via Mon te Carlo simulations [16] (i.e. sparse Ba y esian learning [17, 20, 30, 33]). Nevertheless, VI is m uc h less popular in k ernel-based system iden tiﬁcation. The up dated factors in each iteration no longer ha v e closed-form expressions due to the complex structure of k ernel functions, which requests inner sampling lo ops. In addition, op erations of high-dimensional matrices are more inv olved in this case. In particular, VI hav e to calculate the inv ersion and determinan t of the ill- conditioned cov ariance matrix constructed from kernel functions. Nevertheless, b y using TC kernel, the com- putational eﬃciency and robustness of VI are dramati- cally impro ved, which makes VI applicable to practical applications. 4 V ARIA TIONAL INFERENCE OF D YNAM- ICAL STR UCTURE FUNCTIONS 4.1 Up date of r andom quantities F or eac h mo del structure, M k of (18), the corresp onding full Ba yesian mo del, p ( w, σ, λ, β | Y , M k ) is approximated b y a candidate distribution, Q ( w , σ, λ, β |M k ) using the mean ﬁeld factorization: Q ( w , σ, λ, β |M k ) = q ( w, σ |M k ) q ( λ |M k ) q ( β |M k ) . (26) where Q ( · ) and q ( · ) are v alid probability distributions. Hereafter, the sym b ol, M k is suppressed to simplify the notation. The factors of (26) are formulated according to (8). In what follows, the terms indep enden t on the random v ari- ables of the factor under consideration are ignored for con v enience. T o b egin with, factor q ( w, σ ) is solv ed as the Gaussian-Gamma distribution: ln q ( w q , σ ) = ln N ( w q | µ q , σ − 1 Σ q ) − ln Gamma ( σ | a σ , b σ ) . (27) 6 where Σ − 1 q = Φ 0 q Φ q + E λ (Λ q ) E β ( K − 1 q ) , µ q = Σ q Φ 0 q Y q a σ = P L q =1 N q − T q 2 + a 0 , b σ = b 0 + 1 2 L X q =1 ( Y 0 q Y q − µ 0 q Σ − 1 q µ q ) , E σ,w ( σ w q w 0 q ) = a σ b σ µ q µ 0 q + Σ q , E σ,w ( σ w q ) = a σ b σ µ q . (28) F ollo wing the same pro cedure, factor q ( λ ) is solved as indep enden t Gamma distributions: ln q ( λ i ) = ln Gamma ( λ i | a λ i , b λ i ) . (29) where a λ i = P L q =1 T q 2 + a 0 , E λ ( λ i ) = a λ i b λ i b λ i = b 0 + 1 2 tr ace " L X q =1 E β ( K − 1 q ,i ) E w,σ ( σ w q w 0 q ) T q i # . (30) Finally , factor q ( β ) is calculated as: ln q ( β ) = E w,σ,λ [ln p ( w | σ , λ, β ) + ln p ( β )] + c w,σ,λ = E w,σ,λ [ln p ( w | σ , λ, β )] + ln p ( β ) + c w,σ,λ . (31) where E w,σ,λ [ln p ( w | σ , λ, β )] = L X q =1 − 1 2 ln | K q | − 1 2 tr ace  E λ (Λ q ) K − 1 q E w,σ ( σ w q w 0 q )  , ln p ( β ) = 0 . (32) Unlik e the other random v ariables, factor q ( β ) has no closed-form expression. Nevertheless, the elements of β are indep enden tly distributed as: q ( β i ) = c i L Y q =1 | K q,i | − 1 2 exp  − 1 2 E λ i ( λ i ) trace  K − 1 q,i E w,σ ( σw q w 0 q ) T q i  . (33) where c i is the unknown normalization constant. 4.2 L ower b ound of mo del evidenc e Com bining all the factors ab o ve, the lo w er b ound, L [ Q ( w , σ, λ, β |M k )] of mo del evidence p ( Y |M k ) is: L [ Q ( w , σ, λ, β |M k )] = 1 2 L X q =1 ln | Σ q | −  a σ b σ k Y q − Φ q µ q k 2 2 + tr ace (Φ q Σ q Φ 0 q )  − b 0 a σ b σ − a σ ln b σ − M k X i =1 a λ i ln b λ i + M k P L q =1 T q 2 + M k [ a 0 ln b 0 − ln Γ( a 0 )] − M k X i =1 b 0 a λ i b λ i − a λ i − ln Γ( a λ i ) + ln c i . (34) where the constant terms indep enden t on M k are ig- nored. 4.3 Algorithm for variational infer enc e Unfortunately , factors q ( w , σ ), q ( λ ) and q ( β ) cannot b e solv ed analytically . Hence, they are calculated cyclically in each iteration of the algorithm. The pro cedure is sum- marized in Algorithm 1. In Algorithm 1, term E ( K − 1 ) is estimated in eac h iteration. Ho wev er, since q ( β ) is only kno wn up to a normalization constan t, E ( K − 1 ) cannot b e calcu- lated analytically . Hence, n umerical sampling meth- o ds are used to estimate the exp ectation. Since hy- p erparameters β i ∈ (0 , 1) are independently dis- tributed, they can b e sampled in parallel. W e use the Metrop olis-Hastings sampling metho d to draw samples from (33). A uniform distribution is ap- plied as the prop osal distribution for β i . Let ˆ p ( β i ) = Q L q =1 | K q ,i | − 1 2 exp {− a λ i 2 b λ i tr ace [ K − 1 q ,i ( a σ b σ µ q µ 0 q + Σ q ) T q i ] } . A t current state β k i , a prop osal β p i is drawn from: q p ( β p i | β k i ) =        U ( β t i − ε 2 , β t i + ε 2 ) ε 2 < β k i < 1 − ε 2 U (0 , ε ) ε 2 ≥ β k i U (1 − ε, 1) 1 − ε 2 ≤ β k i . (38) where U ( a, b ) is the uniform distribution on ( a, b ). ε is the selection windo w for sampling, whic h is set to 0 . 1. The acceptance ratio is r ( β p i | β k i ) = min { 1 , ˆ p ( β p i ) q p ( β k i | β p i ) ˆ p ( β k i ) q p ( β p i | β k i ) } . If the prop osal is accepted, β k +1 i = β p i . Otherwise, β k +1 i = β k i . With N samples { β k i | k = 1 , ..., N } , E ( K − 1 q ,i ) is esti- mated as E ( K − 1 q ,i ) = 1 N P N k =1 ( K k q ,i ) − 1 . In addition, in order to calculate the lo wer b ound of mo del evidence, one has to estimate the normalization constan t, c i of q ( β i ) in (33). Since β i is a scalar on (0 , 1), 7 Algorithm 1 V ariational inference of DSF 1: Initialize µ , Σ, a σ , b σ , a λ i , b λ i and E β ( K − 1 ) 2: for k = 1 : M ax do 3: Up date q ( w, σ ) as a Gaussian-Gamma distribu- tion: Σ − 1 q = Φ 0 q Φ q + E λ (Λ q ) E β ( K − 1 q ) µ q = Σ q Φ 0 q Y q a σ = P L q − 1 N q − T q 2 + a 0 b σ = b 0 + 1 2 L X q =1 ( Y 0 q Y q − µ 0 q Σ − 1 q µ q ) (35) 4: Up date q ( λ ) as a Gamma distribution: a λ i = P L q =1 T q 2 + a 0 b λ i = b 0 + 1 2 trace   L X q =1 E β i ( K − 1 q,i )( a σ b σ µ q µ 0 q + Σ q ) T q i   E ( λ i ) = a λ i b λ i (36) 5: Up date q ( β i ) and E β i ( K − 1 i ) according to: q ( β i ) = c i L Y q =1 | K q,i | − 1 2 exp ( − a λ i 2 b λ i trace  K − 1 q,i ( a σ b σ µ q µ 0 q + Σ q ) T q i  ) (37) 6: Up date the lo wer b ound, L [ Q ] of mo del evidence according to (34). 7: if L [ Q ] k − L [ Q ] k − 1 <  then 8: Break 9: end if 10: end for 11: Estimate w as ˆ w = µ and store L [ Q ] for top ology detection n umeric integration metho ds (e.g. adaptive quadra- ture [29]) are suﬃcient to give an accurate estimation based on c − 1 i = R ˆ p ( β i ) dβ i . 4.4 Algorithm implementation The algorithm requires to calculate the inv ersion and determinan t of the cov ariance matrix, K q ,i in each iter- ation. These t wo op erations are computationally heavy b ecause K q ,i is a full matrix, and its in v ersion and de- terminan t must b e calculated thousands of times in the sampling lo op of β i p er iteration. With standard meth- o ds, they b oth demand O ( T 3 q ) work. More imp ortan tly , K i ma y b e ill-conditioned if h yp erparameter β i is close to 0, causing n umerical instability during implementa- tion [7]. Therefore, it is necessary to ﬁnd a robust and eﬃcien t wa y to deal with matrix op erations. It has b een shown that the cov ariance matrix con- structed using TC or DC kernels can b e decomp osed analytically whilst that of SS kernel cannot [4, 6]. As a result, SS kernel is not adopted in our VI framework. Rather, TC kernel that possesses the simplest structure is applied to improv e the robustness and to reduce the computational cost of the algorithm. The inv erse co- v ariance matrix, K − 1 ∈ R T × T constructed using TC k ernel (i.e. [ K ] ts = k ( t, s ; β )) can b e decomp osed as [3]: K − 1 = U 0 W U. (39) where U is a upp er triangular matrix and W is a diagonal matrix as follows: U =        1 − 1 · · · 0 0 . . . . . . . . . . . . 0 . . . − 1 0 · · · 0 1        W = (1 − β ) − 1 diag { β − 1 , β − 2 , ..., β − T +1 , 1 − β β T } . (40) Consequen tly , the matrix inv ersion only demands O ( T ) w ork considering the sparse structure of U and W , and the matrix determinan t ( | K | = β T ( T +1) 2 (1 − β ) T − 1 ) re- quires O (1). With N samples of β ( { β k | k = 1 , .., N } ), the exp ectation of the inv erse matrix costs O ( N T ) work: E ( K − 1 ) = 1 N U 0 " N X k =1 W k # U. (41) 4.5 Dete ction of network top olo gy Hyp erparameter λ − 1 are ARD v ariables, whose zero structure determines the netw ork top ology . Neverthe- less, due to local optimal solutions and numerical errors, none of these estimated v ariables are exactly zero in practical implemen tation. T o impro ve the inference ac- curacy , the backw ard selection metho d is used for model selection. Mo del selection is based on the p osterior distribution of mo del structures (i.e. p ( M k | y ) ∝ p ( y |M k ) p ( M k )). Assuming equal probability for each mo del structure, the distribution is prop ortional to model evidence, p ( M k | y ) ∝ p ( y |M k ). The by-product of VI provides a lo w er b ound of mo del evidence, which can b e used to determine the most probable mo del structure (net- w ork top ology). Nevertheless, the complete ev aluation of mo del evidence requires an exhaustive search of all p ossible model structures, which is computationally prohibitiv e for large-scale netw orks. T o tackle this prob- lem, we come up with a heuristic strategy that applies bac kw ard selection to reduce the computational burden. T o b egin with, VI is conducted to infer a fully-connected net w ork (i.e. M 0 ). The conﬁdence of inferred links is measured by the norm of their impulse responses. These links are remov ed from the model one-b y-one, eac h time pro ducing a particular mo del structure. The VI algorithm is implemented repeatedly with the prop osed 8 mo del structures until the net work becomes empty . The b est mo del structure is determined according to the highest low er b ound of mo del evidence. The ab o ve pro cedure is summarized in Algorithm 2. Algorithm 2 V ariational inference of netw ork topology 1: Implement Algorithm 1 with mo del structure M 0 . 2: Set the threshold: 3: R = asc { P q [ k w q , 1 k , ..., k w q ,n + p k ] } . 4: for k=1:n+p-1 do 5: I = { i | P q k w q ,i k ≤ R k } 6: Remo v e the links in index set I , pro ducing M k 7: Run Algorithm 1 with M k 8: Store the v alue of L [ Q ( w , σ, λ, β |M k )] 9: end for 10: M opt = ar gmax M k L [ Q ( w , σ, λ, β |M k )] 4.6 The or etic al analysis of the algorithm Since the low er b ound dep ends on four v ariables Σ, µ , b λ and c (other v ariables are either constant or deter- mined b y these four), let θ = [Σ , µ, b λ , c ]. Deﬁne { θ k } as the iterates generated using the co ordinate descent. It is known that conv ergence of the low er b ound is guar- an teed if all iterates are calculated explicitly without appro ximations [1, 16]. This section shows that the pro- p osed algorithm still con verges ev en with sto c hastic ap- pro ximations (MCMC) used in our framework. In addi- tion, the proposed algorithm eﬀectively imp oses sparse top ologies. In what follows, { θ k } denote iterates with- out approximations whereas { θ k N 1 , ··· ,N k − 1 } are pro duced from the proposed algorithm where N k is the n umber of samples generated by MCMC in the k th iteration. Prop osition 4 Deﬁne fol lowing se quenc es { Σ M } , { µ M } , { [ b λ ] M } , { c M } , { q M ( β ; Σ M , µ M , [ b λ ] M , c M ) } and q ( β ; Σ , µ, b λ , c ) . If se quenc es ar e such that lim M →∞ Σ M = Σ , lim M →∞ µ M = µ and lim M →∞ [ b λ ] M = b λ , nor- malization c onstants c and { c M } ar e wel l deﬁne d, and lim M →∞ c M = c . Assuming Markov chains { β k M } sample d fr om q M ( β ) ar e Harris r e curr ent, we have lim M ,N →∞ 1 N P N k =1 V ( β k M ) = E q [ V ( β )] with pr ob ability 1 for any b ounde d c ontinuous function V ( · ) . Pro of: Without loss of generality , consider the distribu- tion q ( β ) (33) from a single exp erimen t: q ( β ) = c | K | − 1 2 exp  − 1 2 E λ ( λ ) tr ace  K − 1 E w,σ ( σ ww 0 )   . (42) By using the decomp osition of matrix K (39), the abov e expression can b e expanded as: q ( β ; c, α ) = cp ( β ; α ) , p ( β ; α ) = T Y j =1 W 1 2 j j exp  − 1 2 α j W j j  , c − 1 = Z 1 0 p ( β ) dβ , W j j = ( 1 β j (1 − β ) j < T 1 β T j = T . (43) where α > 0 is a function of Σ, µ , and b λ . According to (28) and (30), and under the conditions of the prop osi- tion, we hav e lim M →∞ α M = α . Consider the following function f ( x ) with x ∈ [0 , + ∞ ) and α > 0: f ( x ) = x 1 2 exp  − 1 2 αx  . (44) Clearly , f ( x ) is non-negative and attains its maxim um at x ? = 1 α with f ( x ? ) = 1 √ eα . Hence, p ( β ; α ) is also b ounded. Since p ( β ; α ) is con tinuous, it is Riemann in- tegrable on [0 , 1] and hence Leb esgue integrable. Since lim M →∞ α M = α , we ha ve lim M →∞ p M ( β ; α M ) = p ( β ; α ) p oin twise. As { α M } con v erges, there ex- ist α and m α suc h that α M ≥ α > 0 for M > m α . As a result, p M ( β ; α M ) ≤ p ( β ; α ) for β ∈ [0 , 1] , M > m α . Since p ( β ; α ) is Lebesgue in tegrable, lim M →∞ R 1 0 p M ( β ; α M ) dβ = R 1 0 p ( β ; α ) dβ according to the dominated con v ergence theorem. Even tually , c and c M are well deﬁned, and lim M →∞ c M = lim c . It is then clear that lim M →∞ q M ( β ; c M , α M ) = q ( β ; c, α ) p oin twise. According to the Sc heﬀe’s theorem, ran- dom v ariables β M (distributed as q M ( β )) conv erge to β in distribution. Therefore, lim M →∞ E q M [ V ( β M )] = E q [ V ( β )]. With Harris recurren t Mark o v chains, we ha v e lim N →∞ 1 N P N k =1 V ( β k M ) = E q M [ V ( β M )] with probabilit y 1. Assuming MCMC prop osals are w ell de- signed so that the conv ergence is uniform in M , we hav e lim M ,N →∞ 1 N P N k =1 V ( β k M ) = E q [ V ( β )] based on the theory of double sequence.  With Prop osition 4, w e are able to pro v e the algorithm con v ergence. Prop osition 5 L ower b ound L ( θ k N 1 , ··· ,N k − 1 ) c onver ges to its maximum L ? = max L ( θ ) with pr ob ability 1 as the numb er of samples ( N k ) and iter ations ( k ) appr o aches inﬁnity. Pro of: Since L [ Q ( θ )] is conv ex with resp ect to each fac- tor q i ( θ i ), con vergence of the lo wer b ound is guaran teed using the co ordinated descent metho d if up dates are calculated explicitly [1, 36]. Hence, lim k →∞ L ( θ k ) = L ? where θ k j = arg min L ( θ j ; θ k ij ) (parameters are 9 up dated in sequence). How ever, in this pap er, some comp onen ts θ k j are intractable. Rather, they are es- timated using sampling metho ds. The ob jectiv e is to sho w that lim N 1 , ··· ,N k − 1 →∞ θ k N 1 , ··· ,N k − 1 = θ k for ∀ k . The proof is conducted using mathematical induc- tion. Assuming lim N 1 , ··· ,N k − 2 →∞ θ k − 1 N 1 , ··· ,N k − 2 = θ k − 1 holds at iteration k − 1, lim N 1 ··· N k − 1 →∞ θ k N 1 , ··· ,N k − 1 = θ k also holds at iteration k according to Prop o- sition 4. In addition, we ha v e lim N 1 →∞ θ 2 N 1 = θ 2 with probability 1 after initialization θ 1 . Hence, lim N 1 , ··· ,N k − 1 →∞ θ k N 1 , ··· ,N k − 1 = θ k for ∀ k . Since low er b ound (34) is con tinuous with respect to all parameters, w e hav e lim k →∞ lim N 1 , ··· ,N k − 1 →∞ L ( θ k N 1 , ··· ,N k − 1 ) = L ? .  Finally , w e argue that the proposed algorithm promotes sparse top ologies and is at least comp etitiv e with em- pirical Bay es [9, 36]. Remark 6 The pr op ose d algorithm imp oses sp arse top olo gies and is at le ast as eﬃcient as empiric al Bayes. F or any ﬁxe d a σ , b σ and q ( β ) , the pr op ose d algorithm only ne e ds to up date E ( λ ) which determines the sp arsity of top olo gies sinc e w i → 0 as E − 1 ( λ i ) → 0 . Consider the fol lowing optimization pr oblem derive d fr om empiric al Bayes [36]: arg min γ ˆ Y 0 ( σ I + ˆ ΦΓ ˆ K ˆ Φ 0 ) − 1 ˆ Y + ln | σ I + ˆ ΦΓ ˆ K ˆ Φ 0 | . (45) wher e σ = ( a σ b σ ) − 1 2 , ˆ Φ = ( a σ b σ ) − 1 4 Φ , ˆ Y = ( a σ b σ ) 1 4 Y and ˆ K = E − 1 ( K − 1 ) . Γ is a blo ck diagonal matrix w hose zer o structur e determines the sp arsity of top olo gies. By using the exp e ctation maximization metho d (EM) to solve the pr oblem, the r esulting up date of Γ is exactly the same as E − 1 ( λ ) in the pr op ose d algorithm. 5 SIMULA TION W e conducted Monte Carlo simulations and compared our metho d with another k ernel-based system identi- ﬁcation approach. The state-of-the-art metho d applies empirical Bay es to estimate h yp erparameters and to de- tect net w ork top ology , where DC (Kernel DC), SS (Ker- nel SS) and TC (Kernel TC) k ernels were all used for inference. KEB solves the optimization problem as fol- lo ws. More details can b e found in [9]. arg min σ,γ ,β Y 0 ( σ I + ΦΓ K Φ 0 ) − 1 Y + ln | σ I + ΦΓ K Φ 0 | . (46) where Γ is equiv alent to Λ − 1 in our framework. K is the cov ariance matrix constructed by kernel functions. If γ i is 0, the i th no de do es not control the target no de. T o select the inferred lin ks, a similar bac kward selection metho d is used [9]. The DSF net works for test were generated randomly with diverse top ologies (including an extremely sparse t yp e, a ring structure), simulated under v arious noise lev els and inferred using time-series data of diﬀeren t lengths. Tw o criteria are applied to ev aluate the p erformance of algorithms, namely T rue Positiv e Rate (TPR) and Pre- cision (PREC). TPR shows the p ercen tage of the true links in the ground truths that were successfully inferred. TPR implies the information richness of the inference result. PREC equates to the rate of the correct links o v er the all inferred. PREC indicates the reliability or accuracy of the inferred net work. Hence, ensuring a high PREC is the ﬁrst priority in real applications. If, for ex- ample, PREC is b elo w 50%, one cannot tell which in- ferred links are correct. T o inv estigate whether the in- fer net works ha ve in ternal dynamics consistent with the ground truths, the estimated mo dels were simulated to predict the v alidation dataset that was not used for in- ference. The prediction accuracy is measured based on the metric as follows. f itness = 100  1 − k y − ˆ y k k y − ¯ y k  (47) where y are the v alidation data of a certain node, ˆ y are the predicted output and ¯ y are the mean of the v alidation data. The ﬁtness of all no des w as av eraged to generate the conclusion. 5.1 R andom DSF networks 100 netw orks w ere generate d with random top ologies and in ternal dynamics. All net w orks contained 15 nodes in total. Eac h no de was indep enden tly driv en b y an in- put that w as kno wn and process noise that w as assumed to b e unknown during inference. DSF mo dels were pro- duced from state space mo dels (9). T o b e speciﬁc, a sparse stable matrix A ∈ R 15 × 15 w as ﬁrst yielded ran- domly using the function spr andn ( n, n, density ) in Mat- lab. The brute force strategy was applied to guarantee that matrix A was Hurwitz (that is, no eigenv alue was outside the unit circle of the complex plane) and that no isolated no des existed in the netw ork. T o simulate the mo dels, inputs and pro cess noise were b oth i.i.d. white Gaussian signals. The v ariance of in- puts was ﬁxed to 1 whilst that of pro cess noise v aried. The Signal-to-Noise ratio is deﬁned as S N R = 10 log σ u σ e where σ u and σ e are signal v ariance of inputs and noise, resp ectiv ely . The ﬁrst 10 states of the models w ere mea- sured, lea ving the rest 5 as hidden no des. Figure 1 dis- pla ys one example of the resulting netw orks. The length of impulse responses after truncation was set to 20. The data for inference were collected with diﬀerent lengths. The a v erage TPR and PREC ov er 100 trials are recorded in T able 1-3. In the b est-case scenario (no noise), VI outp erforms all the other metho ds. In particular, almost all the inferred links of VI are correct ( P RE C ≈ 100%), regardless of the num b er of data p oin ts. Meanwhile, VI is able to identify most true links of the ground truths. 10 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Fig. 1. The structure of a randomly generated netw ork. Solid lines with arro ws represen t links. Red circles denote nodes. 45 65 85 60 80 100 Fitness (%) Prediction Fitness (no noise) 45 65 85 -150 -100 -50 0 50 Prediction Fitness (no noise) 100 200 300 40 60 80 Fitness (%) Prediction Fitness (SNR=10dB) 300 500 1000 Number of data points 10 15 20 Fitness (%) Prediction Fitness (pure noise) Kernel_DC Kernel_SS Kernel_TC VI Fig. 2. Prediction of randomly generated net w orks. With 85 data p oin ts, VI recov ers the netw ork p erfectly . In con trast, Kernel TC presents the weak est result. PREC of Kernel TC sta ys low unless more data p oin ts are av ailable for inference. The p oor p erformance of Kernel TC indicates the eﬀectiv eness of VI that uses the same kernel function. As S N R decreases to 10 dB , all metho ds require more data p oin ts to counteract the interference of process noise. Although TRP of VI is slightly low er than Ker- nel SS, PREC of VI is muc h higher than Kernel SS, whic h is close to 100%. Kernel TC also achiev es accu- rate results. Nevertheless, many true links are missed ( T P R < 80%). It is remark able that the inferred netw orks of VI are highly reliable ( P RE C ≈ 100%) even under the worst- case scenario (that of pure noise). The lac k of data p oin ts only aﬀects TPR of VI whilst PREC remains high. The gap of TPR b et ween Kernel SS, Kernel DC and VI de- creases gradually as more data points are av ailable. Simi- lar to the ab o v e cases, Kernel TC is outp erformed b y VI. Based on the simulations, VI presents great adv antages on inference accuracy ov er the other metho ds. Almost all the inferred links of VI are correct regardless of the noise lev el and n umber of data p oin ts. Generally , TPR of VI is sligh tly lo wer than Kernel SS. Ho w ev er, VI ac- tually missed only a few more true links compared with Kernel SS since the target netw orks w ere sparse. On a v- erage, there were 18 . 25 links p er netw ork in the simula- tions. According to the results, VI, at most missed, 2 . 7 links more than Kernel SS. The v alidation result is shown by the b o x plot in Fig- ure 2. With negligible pro cess noise, VI clearly outp er- forms all the other metho ds. The prediction ﬁtness of VI is abov e 80% and reaches approximately 100%, giv en enough data. As S N R decreases to 10 dB , all metho ds, except Kernel TC present similar p erformance. When pro cess noise o v erwhelms inputs, the prediction ﬁtness of all metho ds drops seriously to b elo w 30%. In this case, the performance of VI is slightly better than the others. Note that Kernel TC presents the weak est result under diﬀeren t noise lev els, implying that VI outp erforms KEB at least with TC kernel. T able 1 Inference of random net w orks with no noise No Noise 45 65 85 PREC TPR PREC TPR PREC TPR Kernel DC 94.3 56.9 97.7 73.1 97.9 85.2 Kernel SS 84.7 58.3 91.4 91.7 100 100 Kernel TC 43.3 16.5 71.1 23.4 98.3 42.9 VI 100 75.0 99.7 98.5 100 100 T able 2 Inference of random net w orks with 10 dB SNR 10dB 100 200 300 PREC TPR PREC TPR PREC TPR Kernel DC 95.6 77.1 98.6 84.2 96.5 86.9 Kernel SS 74.1 82.5 79.7 88.3 86.3 91.0 Kernel TC 92.5 46.5 100 66.2 99.6 75.7 VI 100 68.2 99.7 81.3 100 86.4 T able 3 Inference of random net w orks with pure noise No Input 300 500 1000 PREC TPR PREC TPR PREC TPR Kernel DC 77.6 72.9 89.1 75.6 96.9 79.1 Kernel SS 64.3 69.6 81.7 76.1 88.0 77.1 Kernel TC 77.9 60.3 87 65.7 93.4 70.9 VI 100 56.5 100 65.0 100 74.9 5.2 Ring networks 100 netw orks with the ﬁxed ring structure as shown in Figure 3 were generated and simulated following the same proto col in the last section. Each no de w as driven b y indep enden t pro cess noise. Only one input entered the netw ork through a single no de. Since the netw ork 11 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Fig. 3. A netw ork with the ring structure. Symbol ’ ∼ ’ denotes the input signals. con tains a feedbac k lo op and is extremely sparse, it is more challenging to infer. T able 4 presents the inference result. VI is still able to pro duce reliable netw orks with the highest PREC among all the metho ds ( P RE C = 100%). More imp ortan tly , PREC of the other three cases highly relies on the n um- b er of data p oin ts whilst that of VI do es not. TPR of VI and Kernel SS is very close. Since the ring netw ork con tained only 10 links, VI at most missed 3 true links. Sim ulations indicate that VI is able to provide reli- able inference results and iden tify most true links of the ground truths even if the target net w orks are ex- tremely sparse. More imp ortan tly , the p erformance of VI is robust ( P R E C ≈ 100%), which is crucial in real applications where the measuremen ts are not suﬃcien t for inference. The v alidation result is sho wn in Figure 4. Since the ring netw ork was mostly driven b y pro cess noise, the prediction ﬁtness of all metho ds is b elo w 30%. The pre- diction accuracy of VI is comp etitiv e with Kernel DC. Kernel TC presents the weak est result. T able 4 Inference of ring net w orks with 10 dB SNR 10dB 100 200 300 400 PREC TPR PREC TPR PREC TPR PREC TPR Kernel DC 51.0 75.0 75.8 81.0 86.2 84.0 93.8 84.5 Kernel SS 42.2 76.5 64.2 83.5 67.8 86.5 74.6 88.5 Kernel TC 76.9 19.3 99.3 32.5 97.3 55.0 98.5 68.5 VI 100 27.5 100 66.0 100 73.5 100 80.0 6 CONCLUSION AND DISCUSSION This pap er has applied v ariational inference to identify DSF models given measured time series data. No prior kno wledge of the hidden no des including their num b er and internal connectivity is required. Both the system dynamics and topology of sparse linear netw orks can be inferred accurately . The proposed metho d achiev es this b y incorp orating kernel-based metho ds to promote sys- tem stability and b y introducing VI to imp osing net- w ork sparsity . By decomp osing the k ernel function, the 100 200 300 400 Number of data points -10 -5 0 5 10 15 20 25 Fitness (%) Prediction Fitness (10dB) Kernel_DC Kernel_SS Kernel_TC VI Fig. 4. Prediction of ring net w orks. resulting algorithm b ecomes computationally eﬃcien t and robust. Monte Carlo sim ulations imply that our metho d alw a ys pro duces reliable inference result regard- less of challenging exp erimen tal conditions (for exam- ple, low er num b er of data p oin ts, high levels of noise, and extremely sparse top ologies). The inference of links is highly accurate (with nearly 100% conﬁdence): only a few true links are missed. Therefore, the developed metho d app ears highly reliable for real-world applica- tions. Ov erall, the v alue of this approach is that it outp erforms KEB at least in regard to TC kernel. Our metho d raises the reliabilit y of inference results to the highest level (100% P RE C ). In particular, out metho d is applicable to real-w orld netw orks where full state measurements are normally unav ailable such as gene regulatory netw orks. The p erformance of our method ma y b e further im- pro v ed using other kernel functions (e.g. DC and SS). Ho w ev er, the computational cost of doing so is heavy and the robustness of the algorithm is not guaranteed. F uture developmen ts should include t w o asp ects. The ﬁrst is to ﬁnd an eﬀective decomp osition for other kernel functions so that they can b e used in our framework. The second asp ect is to further impro ve TPR of inferred net- w orks while maintaining high PREC. Considering most real-w ord net w orks are nonlinear, it is necessary to ex- tend our framework to black-box nonlinear systems. References [1] C. Bishop. Pattern r e c o gnition and machine learning . Springer New Y ork, 2006. [2] D. M. Blei, A. Kucukelbir, and J. D. McAuliﬀe. V ariational inference: A review for statisticians. Journal of the Americ an Statistic al Asso ciation , 112(518):859–877, 2017. 12 [3] G. Bottegal, A. Y. Ara vkin, H. Hjalmarsson, and G. Pillonetto. Robust em kernel-based metho ds for linear system identiﬁcation. Automatic a , 67:114 – 126, 2016. [4] F. P . Carli, T. Chen, and L. Ljung. Maximum entrop y k ernels for system identiﬁcation. IEEE T r ansactions on A utomatic Contr ol , 62(3):1471–1477, March 2017. [5] T. Chen, M.S. Andersen, L. Ljung, A. Chiuso, and G. Pillonetto. System identiﬁcation via sparse multiple kernel-based regularization using sequen tial conv ex optimization techniques. Automatic a , 59(11):1–33, 2014. [6] T. Chen, T. Ardeshiri, F. P . Carli, A. Chiuso, L. Ljung, and G. Pillonetto. Maximum entrop y prop erties of discrete-time ﬁrst-order stable spline k ernel. Automatic a , 66:34 – 38, 2016. [7] T. Chen and L. Ljung. Implementation of algorithms for tuning parameters in regularized least squares problems in system identiﬁcation. Automatic a , 49(7):2213 – 2220, 2013. [8] T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer functions, regularizations and gaussian pro cesses- revisited. Automatic a , 48(8):1525 – 1535, 2012. [9] A. Chiuso and G. Pillonetto. A Bayesian approach to sparse dynamic netw ork identiﬁcation. Automatic a , pages 1553– 1565, 2012. [10] F. Cuc ker and S. Smale. On the mathematical foundations of learning. Bul letin of the Americ an Mathematical So ciety , 39:1–49, 2002. [11] M. Darwish, P . Cox, G. Pillonetto, and R. Tth. Bay esian identiﬁcation of lpv b o x-jenkins mo dels. In 2015 54th IEEE Confer enc e on De cision and Control (CDC) , pages 66–71, Dec 2015. [12] F. Dinuzzo. Kernels for linear time inv ariant system identiﬁcation. SIAM Journal on Contr ol and Optimization , 53(5):3299–3317, 2015. [13] J. Gon¸ calves and S. W arnick. Necessary and suﬃcient conditions for dynamical structure reconstruction of lti netw orks. IEEE T r ans. Automat. Contr. , 53(7):1670–1674, 2008. [14] A. Grav es. Practical v ariational inference for neural netw orks. In J. Shaw e T aylor, R. S. Zemel, P . L. Bartlett, F. Pereira, and K. Q. W einberger, editors, A dvances in Neural Information Pr o cessing Systems 24 , pages 2348–2356. Curran Asso ciates, Inc., 2011. [15] M. Hoﬀman, D.M. Blei, C. W ang, and J. Paisley. Stochastic V ariational Inference. ArXiv e-prints , June 2012. [16] W. R. Jacobs, T. Baldacchino, T. J. Dodd, and S. R. Anderson. Sparse bayesian nonlinear system identiﬁcation using v ariational inference. IEEE T r ansactions on Automatic Contr ol , pages 1–1, 2018. [17] J. Jin, Y. Y uan, W. P an, C. T omlin, A. A. W ebb, and J. Gon¸ calves. Identiﬁcation of nonlinear sparse netw orks using sparse ba yesian learning. In 2017 IEEE 56th A nnual Confer enc e on De cision and Contr ol (CDC) , pages 6481– 6486, Dec 2017. [18] A. Mnih and K. Gregor. Neural V ariational Inference and Learning in Belief Net works. ArXiv e-prints , January 2014. [19] Kevin P . Murphy . Machine L e arning: A Pr ob abilistic Persp e ctive . The MIT Press, 2012. [20] W. Pan, Y. Y uan, J. Gon¸ calves, and G. Stan. A sparse bay esian approach to the identiﬁcation of nonlinear state- space systems. IEEE T r ansactions on Automatic Contr ol , 61(1):182–187, Jan 2016. [21] G. Pillonetto and A. Chiuso. T uning complexity in regularized kernel-based regression and linear system identiﬁcation: The robustness of the marginal likelihood estimator. Automatic a , 58:106 – 117, 2015. [22] G. Pillonetto, A. Chiuso, and G.D. Nicolao. Prediction error identiﬁcation of linear systems: A nonparametric gaussian regression approach. Automatic a , 47(2):291 – 305, 2011. [23] G. Pillonetto, F. Din uzzo, T. Chen, G. De Nicolao, and L. Ljung. Kernel metho ds in system identiﬁcation, mac hine learning and function estimation: A surv ey . Automatic a , 50(3):657–682, 2014. [24] G. Pillonetto and G. De N icolao. A new kernel-based approach for linear system identiﬁcation. A utomatica , 46(1):81–93, 2010. [25] G. Pillonetto, M. H. Quang, and A. Chiuso. A new kernel- based approach for nonlinear system identiﬁcation. IEEE T r ansactions on A utomatic Contr ol , 56(12):2825–2840, Dec 2011. [26] C. E. Rasmussen and C. K. I. Williams. Gaussian pro c esses for machine le arning. The MIT Press, 2006. [27] R. S. Risuleo. System Identiﬁc ation with input unc ertainties: an EM kernel-b ase d appr o ach . PhD thesis, KTH of School of Elextrical Engineering, 2016. [28] R. S. Risuleo, G. Bottegal, and H. Hjalmarsson. Kernel- based system iden tiﬁcation from noisy and incomplete input- output data. In 2016 IEEE 55th Conferenc e on Decision and Contr ol (CDC) , pages 2061–2066, Dec 2016. [29] L.F. Shampine. V ectorized adaptive quadrature in matlab. Journal of Computational and Applie d Mathematics , 211(2):131 – 140, 2008. [30] M. Tipping. Sparse Bay esian learning and the relev ance vector machine. J. Mach. L earn. R es. , 1:211–244, 2001. [31] M. K. Titsias and L.G. Miguel. Spike and slab v ariational inference for m ulti-task and m ultiple kernel learning. In J. Shaw e-T aylor, R. S. Zemel, P . L. Bartlett, F. Pereira, and K. Q. W einberger, editors, A dvances in Neural Information Pr o cessing Systems 24 , pages 2339–2347. Curran Asso ciates, Inc., 2011. [32] M. J. W ainwright and M. I. Jordan. Graphical mo dels, exponential families, and v ariational inference. F ound. T r ends Mach. L e arn. , 1(1-2):1–305, January 2008. [33] D.P . Wipf, B.D. Rao, and S. Nagara jan. Laten t v ariable Bay esian mo dels for promoting sparsity . IEEE T rans. Inf. The ory. , 57(9):6236–6255, 2011. [34] Y. Y uan, A. Rai, E. Y eung, G. Stan, S. W arnick, and J. Gon¸ calves. A minimal realization tec hnique for the dynamical structure function of a class of lti systems. IEEE T r ansactions on Contr ol of Network Systems , 4(2):301–311, June 2017. [35] Y. Y uan, G. Stan, S. W arnick, and J. Gon¸ calves. Robust dynamical netw ork structure reconstruction. A utomatic a , 47:1230–1235, 2011. [36] Z. Zhang and B. Rao. Extension of sbl algorithm for the recov ery of block sparse signals with in tra-block correlation. IEEE T r ansactions on Signal Pr oc essing , 61:2009–2015, 2013. [37] M. Zorzi and A. Chiuso. Sparse plus low rank net work identiﬁcation : A nonparametric approach. Automatic a , 76:355–366, 2016. 13

High Precision Variational Bayesian Inference of Sparse Linear Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment