A Full Bayesian Approach to Sparse Network Inference using Heterogeneous Datasets
Network inference has been attracting increasing attention in several fields, notably systems biology, control engineering and biomedicine. To develop a therapy, it is essential to understand the connectivity of biochemical units and the internal wor…
Authors: Junyang Jin, Ye Yuan, Jorge Goncalves
1 A Full Bayesian Approach to Sparse Network Inference using Heterogeneous Datasets Junyang Jin, Y e Y uan, and Jorge Gonc ¸ alves Abstract —Network inference has been attracting increasing attention in several fields, notably systems biology, control en- gineering and biomedicine. T o de velop a therapy , it is essential to understand the connectivity of biochemical units and the internal working mechanisms of the target network. A network is mainly characterized by its topology and internal dynamics. In particular , sparse topology and stable system dynamics are fundamental properties of many real-world networks. In recent years, kernel-based methods have been popular in the system identification community . By incorporating empirical Bayes, this framework, which we call KEB, is able to pr omote system stability and impose sparse network topology . Nevertheless, KEB may not be ideal for topology detection due to local optima and numerical errors. Her e, theref ore, we pr opose an alterna- tive, data-driven, method that is designed to greatly impro ve inference accuracy , compared with KEB. The proposed method uses dynamical structure functions to describe netw orks so that the information of unmeasurable nodes is encoded in the model. A powerful numerical sampling method, namely rev ersible jump Mark ov chain Monte Carlo (RJMCMC), is applied to explore full Bayesian models effectively . Monte Carlo simulations indicate that our appr oach pr oduces mor e accurate networks compared with KEB methods. Furthermore, simulations of a synthetic biological network demonstrate that the performance of the proposed method is superior to that of the state-of- the-art method, namely iCheMA. The implication is that the proposed method can be used in a wide range of applications, such as controller design, machinery fault diagnosis and therapy development. Index T erms —System Identification, Reversible Jump Markov Chain Monte Carlo, Dynamical Structur e Function, Network Inference, Sparse Networks. I . I N T RO D U C T I O N This paper is concerned with network inference problem in the case of both industrial and biological systems. In industry , communication systems are typically designed with a sparse and stable structure to reduce ener gy consumption and to ensure long-term operation. In biology , most networks are inherently stable, with biochemical species maintained on a normal le vel. Their internal connecti vity is also sparse, enabling ef ficient working mechanisms. It follo ws that, for net- work inference, sparsity and stability are essential preliminary conditions. In recent years, kernel-based non-parametric system iden- tification methods have been prev alent. Such methods ha ve sev eral advantages in the real-world applications. When iden- tifying linear models, the estimation of model complexity is Junyang Jin is with Circadian Signal Transduction Group, Department of Plant Sciences, University of Cambridge. Y e Y uan is with School of Automa- tion, Huazhong Uni versity of Science and T echnology . Jorge Gonc ¸ alves is with the Department of Engineering, Univ ersity of Cambridge and the Luxembourg Centre for Systems Biomedicine. (Corresponding author: Y e Y uan). av oided. More importantly , by using proper kernel functions, kernel-based methods enforce system stability ef fectiv ely . The kernel machine is associated with Gaussian processes [1], [2]. Under the Bayesian paradigm, empirical Bayes has been widely applied for the estimation of hyperparameters of kernel functions [3], [4]. As hyperparameters control the property of system dynamics, this combined framew ork (KEB) greatly im- prov es the estimation accuracy of input-output maps of target systems [5], [6]. Kernel-based methods hav e been discussed in a wide range of contexts including linear continuous/discrete time systems [7]–[9], and nonlinear NARX, N ARMAX and NFI models [10]. Moreover , KEB is endowed with the mech- anism of Automatic Rele vance Determination (ARD) and so is able to promote sparse solutions. For example, KEB has been used to solve multi-kernel selection problems [11]. In netw ork inference, a further crucial task topology de- tection (selection of model structure). Under the framework of KEB, the sparsity profile of ARD parameters determines network topology [12], which requires accurate estimation of hyperparameters. KEB has been shown to be robust for local optimal solutions: identified models can represent the system dynamics of ground truths quite well, e ven if only suboptimal solutions are achie ved [13]. Hyperparameter estimation becomes more challenging, howe ver , under the conte xt of network inference. Where the target network is very sparse, a small estimation error of the zero structure of hyperparameters can seriously degrade the reliability of inference. This can be caused by either local optima or numerical errors during implementation. Model sec- tion strategies (e.g. backw ard selection [12]) may be applied as a remedy . In real world applications, howe ver , they hav e limitations: the confidence of inference cannot be ev aluated and computational cost is greatly increased especially for large-scale networks. Monte Carlo techniques (MC) pro vide alternativ e approxi- mation inference [3]. The y belong to stochastic approxima- tions based on numerical sampling rather than on approx- imating Bayesian models analytically like KEB. Re versible jump Markov chain Monte Carlo (RJMCMC) is one of the MC approaches that was originally de veloped for Bayesian model selection [14]. RJMCMC is able to draw samples from a distribution whose random v ariables are of varying dimension. RJMCMC has been applied in many research fields including optimization [15], [16], machine learning [17], [18], signal processing [19], and system identification [20], [21]. For a network, the dimension of model parameters depends on network topology: only true links are endo wed with parameters. Therefore, RJMCMC represents a promising 2 method for network inference. This paper combines k ernel-based methods and RJMCMC to infer sparse networks. Dynamical structure functions are used to describe networks so that the information of hidden nodes can be encoded via transfer functions. As a non- parametric method, the k ernel machine is applied to impose stable impulse responses of networks. RJMCMC is adopted to explore the resulting Bayesian model whose sample space consists of multiple subspaces of dif ferent dimensionality . By trav ersing these subspaces, RJMCMC provides a highly efficient way to infer system dynamics and to detect network topology . In particular, the effect of ADR is maximally ac- tiv ated by the merit of RJMCMC, thus encouraging sparse topologies. Monte Carlo simulations indicate that our method further improves inference accuracy compared with KEB. The performance impro vement is greatest when inferring a synthetic biological network. Thus the contribution of our study lies in the provision of a method that is more reliable than KEB for real-world applications. The paper is or ganized as follows. Section II o vervie ws MH- within-PCG samplers and RJMCMC. Section III introduces dynamical structure function and formulates the full Bayesian model. Section IV discusses netw ork inference using RJM- CMC. Section V compares the method with other approaches via Monte Carlo simulations. Finally , Section VI concludes and discusses further dev elopment in this field. Notation : The notation in this paper is standard. I m denotes the m × m identity matrix. For L ∈ R n × n , diag { L } denotes a vector which consists of diagonal elements of matrix L . [ L ] ij presents the ij th entry and L (: , i : j ) the columns from i to j . bl k diag { L 1 , ..., L n } is a block diagonal matrix. For l ∈ R n , diag { l } denotes a diagonal matrix whose diagonal elements come from vector l . [ l ] ij denotes the j th element of the i th group of l . l ≥ 0 means each element of the vector is non-negativ e. y ( t 1 : t 2 ) denotes a ro w vector y ( t 1 ) y ( t 1 + 1) · · · y ( t 2 ) . I I . O V E RV I E W O F M A R K O V C H A I N M O N T E C A R L O A. MH-within-PCG Sampler Markov Chain Monte Carlo (MCMC) is widely applied to draw samples from probabilistic models. The samples consist of a Markov chain that asymptotically distributes as the target distribution. The samples can be used to ev aluate mar ginal distributions and to estimate expectation of random variables, which cannot be calculated in a closed form. Gibbs sampling and Metropolis-Hastings method (MH) are two typical MCMC techniques [4]. They have their own strength and weakness. Gibbs samplers hav e a simple structure but require analytical conditional distributions. MH samplers can sample from a distribution known up to a constant b ut their design is more in volv ed. In practice, Gibbs and MH samplers are often modified and combined accordingly to handle complex distributions. For example, blocked Gibbs samplers are a minor v ariant of traditional Gibbs samplers, which group two or more variables and sample from their joint conditional distributions so that a better con ver gence property is achie ved [22]. Another example is the single-site updating MH sampling where only one component of the Marko v state is updated at a time so that the proposal distributions can be simplified [23]. Gibbs and MH sampling can also be combined to build a more powerful and ef ficient sampler . If the conditional distributions of some sampling steps of a Gibbs sampler hav e no closed form, one can replace these steps with MH sampling schemes, leading to a hybrid sampler (MH-within- Gibbs sampler) [24]. By marginalizing out certain random variables, the conv ergence property of a sampler can be further improved [25]. Modified Gibbs and MH-within-Gibbs samplers based on this principle are called partially collapsed Gibbs sampler (PCG) and Metropolis-Hastings within partially collapsed Gibbs sampler (MH-within-PCG), respectiv ely [24]. Deducing a PCG from a Gibbs sampler is nontrivial because the full conditional distributions of the sampling steps cannot be marginalized directly . Otherwise, the in variant distribution of the Markov chain may be changed. This issue was not sufficiently aw are of in much of the previous research. It has been shown that some rules must be followed to preserv e the in variant distribution [24]. T o reduce the number of con- ditioned random v ariables, the steps called marginalization, permutation and trimming are ex ecuted in sequence [25]. Marginalization means to mo ve components of Marko v states from being conditioned on to being sampled. For exam- ple, one can replace sampling from p ( x | y , z ) with sampling from p ( x, y | z ) safely . Permutation means to switch the order of sampling steps. Finally , trimming is used to discard a subset of components that are not conditioned on in the next step. For instance, if the sampling step of p ( x, y | z ) is follo wed by that of p ( y | x, z ) , p ( x | z ) can be sampled instead. Nev ertheless, p ( y | z ) is not a v alid replacement since x is conditioned on in the next step. It is important to realize that sampling steps of a PCG sampler cannot be replaced by their MH counterparts directly . A MH-within-PCG sampler must be deriv ed from the orig- inal MH-with-Gibbs sampler following the similar rules of PCG [24]. The key point is that a full MH step of a MH-with- Gibbs sampler can be replaced by a reduced MH step only if a direct draw from the conditional distrib ution of the reduced quantities follows up immediately [24]. For instance, the MH step to sample from p ( x | y , z ) in a MH-with-Gibbs sampler can be replaced by the reduced MH step to sample from p ( x | z ) , follo wed immediately by sampling from p ( y | x, z ) . The sampling step of p ( y | x, z ) may or may not be trimmed, depending on the next step . B. Reversible J ump Markov Chain Monte Carlo T raditional MCMC is used to draw samples from a distri- bution of random v ariables whose dimension is fixed. There are cases where the dimension of random variables varies. T o explore such a distribution, MCMC samplers must be able to jump between parameter subspaces of dif ferent dimensionality . Rev ersible Jump Markov Chain Monte Carlo (RJMCMC) was designed for this purpose [26], [27]. For a countable collection of Bayesian models {M k , k ∈ Z + } , each model is characterized by a parameter vector θ k ∈ 3 R d k , where the dimension d k may differ from model to model. The random variable to be sampled is x = ( k , θ k ) which lies in the subspace S k = { k } × R d k giv en k . Hence, the entire parameter space is S = S k ∈ Z + S k . Suppose p ( x ) is the probability density function of inter- est. T o draw samples from p ( x ) , a re versible Markov chain { X t , t ∈ Z + } is produced regarding p ( x ) as the in variant distribution. Each Markov state X t consists of two compo- nents, k t and θ t k where k t is the model index and θ t k is the corresponding unkno wn model parameter . T o tra verse across the parameter space S , dif ferent types of moves are proposed, among which only one move is executed per iteration. Theses mov es are selected randomly . Proposal distributions are care- fully designed so that ’ detailed balance’ is achiev ed for each mov e type. The resulting transition distribution of the Markov chain is the mixing of that of all moves. Consequently , the in variant distribution is preserved. Let ( k , θ ) be the current state X t of the Marko v chain where θ ∈ R d k . Based on the proposed moves, the probability to jump from the current model k to the next one k 0 is p kk 0 , where P k 0 p kk 0 = 1 . If k 0 = k , only model parameters are updated in the ne xt state. In addition, it is possible that not all the models can be reached in the next state from the current state, depending on the mov es av ailable. Giv en the proposed k 0 with probability p kk 0 , θ 0 ∈ R d k 0 is generated as the proposal for the model parameter . One way to generate θ 0 is to first produce a random quantity U with the probability density q kk 0 ( u | θ ) and then map θ and U to R d k 0 . As a result, θ 0 = g 1 kk 0 ( θ , U ) where U ∈ R d kk 0 and g 1 kk 0 : R d k + d kk 0 → R d k 0 is a deterministic map [27]. The proposal X prop = ( k 0 , θ 0 ) is then accepted with probability A kk 0 ( θ 0 | θ ) . If accepted, X t +1 = X prop . If not, X t +1 = X t . For the move from ( k, θ ) to ( k 0 , θ 0 ) and the rev erse move from ( k 0 , θ 0 ) to ( k , θ ) , their corresponding proposals, ( θ , U ) and ( θ 0 , U 0 ) must have equal dimension. This is called dimen- sion matching: d k + d kk 0 = d k 0 + d k 0 k [14]. In addition, there must exist a deterministic map g 2 kk 0 : R d k + d kk 0 → R d kk 0 such that ( θ 0 , U 0 ) = g kk 0 ( θ , U ) = ( g 1 kk 0 ( θ , U ) , g 2 kk 0 ( θ , U )) where map g kk 0 is bijective and differentiable [27]. Finally , to achiev e ’ detailed balance’, the following equation must be satisfied [27]: π ( k , θ ) p kk 0 q kk 0 ( U | θ ) A kk 0 = π ( k 0 , θ 0 ) p k 0 k q k 0 k ( U 0 | θ 0 ) A k 0 k ∂ g kk 0 ( θ , U ) ∂ θ∂ U . (1) where θ 0 = g 1 kk 0 ( θ , U ) and U 0 = g 2 kk 0 ( θ , U ) . (2) As a result, the acceptance probability equates to: A kk 0 ( θ 0 | θ ) = min 1 , π ( k 0 , θ 0 ) p k 0 k q k 0 k ( U 0 | θ 0 ) π ( k , θ ) p kk 0 q kk 0 ( U | θ ) ∂ g kk 0 ( θ , U ) ∂ θ∂ U . (3) I I I . M O D E L S P E C I FI C A T I O N A. The dynamical structur e function W e consider a network of p measurable nodes, whose number of hidden nodes is unkno wn. The network can be described by a DSF as follows [28], [29]: Y = Q ( q ; θ ) Y + P ( q ; θ ) U + H ( q ; θ ) E . (4) where q denotes the time shift operator ( y ( t + 1) = qy ( t ) ). Y ∈ R p are measurable nodes. U ∈ R m are inputs. E ∈ R q are i.i.d. Gaussian noise. θ are model parameters. Q , P and H are transfer matrices, each element of which is a transfer function, indicating that the network is a causal system. Matrix Q implies the connectivity among observable nodes. Its transfer functions are strictly proper and its diagonal elements are zero. P and H matrices relate inputs and process noise to nodes, respectiv ely . The transfer functions of matrix P are strictly proper whilst those of matrix H are proper . The topology of the network (i.e. model structure) is reflected by the zero structure of these three matrices. For example, if [ Q ] ij is zero, the j th node does not control the i th node. M k denotes model structures and M k represents the corresponding number of links. In particular , M 1 represents the fully-connected topology . The internal dynamics of the network are described by the transfer functions. The order of the transfer functions is unknown, which is related to the number of hidden states and their internal connectivity . The input-output map of the network is deduced based on the DSF as follows: Y = G u U + G e E . (5) where G u = ( I − Q ) − 1 P and G e = ( I − Q ) − 1 H . Identifi- ability of the networks depends on whether the input-output map is associated to a unique DSF . T o ensure the inference problem is well-posed, additional constraints are imposed to the structure of transfer matrices. Pr oposition 1 (Identifiability of DSF networks [30]): Given a p × ( m + q ) transfer matrix G = [ G u , G e ] , the corresponding DSF is unique if and only if p − 1 elements in each column of [ Q, P , H ] 0 are known, which uniquely specifies the component of ( Q, P , H ) in the null space of [ G 0 , I ] . A suf ficient condition for network identifiability is that matrix H is diagonal so that p − 1 elements in each column of [ Q, P , H ] 0 are known to be zero. In what follows, we make following assumptions so that no prior knowledge of matrix P is required to guarantee network identifiability . Assumption 1: Noise matrix H is diagonal, monic ( lim q →∞ H = I ) and minimal phase. The target networks we consider are sparse and stable. Hence, we make a further assumption re garding netw ork properties. Assumption 2: T ransfer matrices, Q and P are stable and sparse. B. The likelihood distribution After simple manipulations, the DSF in (4) can be reformu- lated as: Y = F y ( q ; θ ) Y + F u ( q ; θ ) U + E . (6) where F u ( q ; θ ) = H − 1 P . F y ( q ; θ ) = I − H − 1 ( I − Q ) . (7) 4 According to the assumptions, transfer matrices, F u and F y are also stable. More importantly , since H is diagonal, F u and F y hav e the same zero structure as P and Q , respectively . Identifying the transfer functions of model (6) is non-trivial. Since the number of hidden states is unkno wn, estimating the order of transfer functions requires an e xhaustiv e search of all possibilities, which is computationally prohibitive for large-scale networks. Additionally , imposing stable transfer matrices is problematic. T o simplify the identification problem, we express model (6) in a non-parametric way . By doing so, the selection of model complexity is av oided and, more importantly , system stability can be promoted effecti vely . The dynamical system for the i th target node, is formulated below: y i ( t ) = p X j =1 ∞ X k =1 h y ij ( k ) y j ( t − k ) + m X j =1 ∞ X k =1 h u ij ( k ) u j ( t − k ) + e i ( t ) . (8) where h y ij and h u ij are the impulse responses of transfer functions [ F y ] ij and [ F u ] ij , respecti vely . e i ( t ) is i.i.d. Gaussian noise. The objecti ve is to estimate the impulse responses. For the implementation purpose, the impulse responses are truncated after sample time T . T is set sufficiently large in order to catch the major dynamics of the impulse responses (i.e. | h ( k ) | ≈ 0 for k ≥ T ). Assume the av ailability of time- series data collected from discrete time indices 1 to N for each node and input. For the i th target node with M 1 and a single experiment, we define the follo wing matrices and vectors. For other possible model structures, M k , the corresponding terms are defined in the same way . Y = y i ( N ) . . . y i ( T + 1) , W = w 1 . . . w p + m . Φ = Φ y Φ u . Φ y = y 1 ( N − 1 : N − T ) · · · y p ( N − 1 : N − T ) . . . . . . . . . y 1 ( T : 1) · · · y p ( T : 1) . Φ u = u 1 ( N − 1 : N − T ) · · · u m ( N − 1 : N − T ) . . . . . . . . . u 1 ( T : 1) · · · u m ( T : 1) . σ = E { e i ( t ) 2 } . (9) where Y ∈ R N − T are time-series of the i th node. W ∈ R T ( p + m ) contain p + m groups of impulse responses, each of which corresponds to a transfer function of F y or F u . Φ ∈ R ( N − T ) × T ( p + m ) include time series of all the nodes and inputs. σ is the noise variance. Note that the dimension of these quantities v aries with respect to the model structure. For example, if node j does not control node i , w j and Φ(: , T ( j − 1) : T j ) must be removed from the corresponding vector and matrix. As a result, for model structure M k , Y ∈ R N − T , W ∈ R T M k and Φ ∈ R ( N − T ) × T M k . Based on Bayes’ rules, the likelihood distribution of the i th target node with M k is: p ( Y W , σ, D, M k ) = (2 π σ ) − N − T 2 exp − 1 2 σ k Y − Φ W k 2 2 . (10) where D denotes the measurements of other nodes and inputs. T o simplify the notation, D is suppressed in the following discussion. C. The prior distributions Full Bayesian treatment deploys prior distributions for each random quantity to build up a hierarchical structure. The prior distributions reflect prior kno wledge and assumptions of the networks. Under the Bayesian paradigm, impulse responses are as- sumed to be independent Gaussian processes [12]. T o im- pose stable impulse responses, the cov ariance function (ker- nel function) must be chosen carefully . It has been sho wn that Tuned/Correlated kernel (TC), Diagonal/Correlated kernel (DC) and second order stable spline kernel (SS) are all capable of characterizing a reproducing Hilbert space (RKHS) for stable impulse responses. They have been frequently applied in the system identification community [7], [31]. Hence, these three kernel functions are all considered in our framework. The prior distribution for W is: p ( W | λ, β , M k ) = M k Y i =1 N ( w i | 0 , λ i K i ) . (11) where K i ∈ R T × T , λ = [ λ 1 , ..., λ M k ] 0 , β = [ β 1 , ..., β M k ] 0 . Note that for DC kernel, β i is a row v ector consisting of two elements whilst it is a scalar for TC and SS kernels. [ K i ] ts = k ( t, s ; β i ) , λ i ≥ 0 , k T C ( t, s ; β i ) = β max ( t,s ) i , β i ∈ (0 , 1) , k DC ( t, s ; β i ) = β ( t + s ) 2 i 1 β | t − s | i 2 , β i 1 ∈ (0 , 1) , β i 2 ∈ ( − 1 , 1) , k S S ( t, s ; β i ) = β t + s + max ( t,s ) 2 − β 3 max ( t,s ) 6 , β i ∈ (0 , 1) . (12) In here, β are hyperparameters of the kernel functions, which control the exponentially decaying rate of impulse responses [8]. λ are scale variables of the kernel functions. They play the role of Automatic Rele vance Determination (ARD) parameters that control sparsity [12]. If λ i approaches zero, the corresponding impulse responses w i are forced to zero, meaning they can be remov ed from the model. Since σ is non-neg ativ e, an In verse-Gamma distribution is assigned as its conjugate prior . W ithout specific preference on σ , parameters a 0 and b 0 of the distribution are set to 0 . 001 , resulting in a non-informativ e prior: p ( σ ; a 0 , b 0 ) = I G ( σ ; a 0 , b 0 ) = b a 0 0 Γ( a 0 ) σ − a 0 − 1 e − b 0 σ , (13) where Γ( · ) is the gamma function. Instead of introducing equal probability for model struc- tures, the prior distribution for M k depends on the number of links, M k . The cardinality of the set of all possible model structures equates to |M| = P M 1 − 1 i =0 C ( M 1 − 1 , i ) 5 where C denotes the combination operator (i.e. C ( n, m ) = n ( n − 1) ··· ( n − m +1) m ! ). The prior of M k is a minor variant of the truncated Poisson distribution: p ( M k | α ) = = α M k ( M k !) − 1 P |M| i =1 α M i ( M i !) − 1 . (14) where α is the rate parameter of the Poisson distribution. Distribution (14) fav ours sparse topologies since higher prob- ability is assigned to model structures with lo wer number of links. In addition, dif ferent topologies that have the same number of links are equally distributed. Finally , hyperpriors are assinged to hyperparameters to complete the hierarchy . F or non-ne gati ve hyperparameter λ i , an In verse-Gamma distribution is applied as the conjugate prior: p ( λ i ; a 1 , b 1 ) = I G ( λ i ; a 1 , b 1 ) . T o impose sparsity , we set a 1 = 2 and b 1 = 1 so that the distrib ution has infinite variance (to support a wide domain) but puts most of weights ov er small values. For hyperparameter β i , a uniform distribution is employed as the prior: T C /S S : p ( β i ) = 1 , β i ∈ (0 , 1) , D C : p ( β i ) = 1 2 , β i 1 ∈ (0 , 1) , β i 2 ∈ ( − 1 , 1) . (15) The conjugate Gamma distribution is assigned to hyperpa- rameter α : p ( α ; a 2 , b 2 ) = Gamma ( α ; a 2 , b 2 ) = b a 2 2 Γ( a 2 ) α a 2 − 1 e − b 2 α . (16) where α is equal to the mean of the Poisson distribution. T o promote sparse topologies, we set a 2 = 0 . 1 and b 2 = 1 so that p ( α ; a 2 , b 2 ) approaches infinity at α = 0 . Nevertheless, since parameters a 2 and b 2 are deep in the hierarchy , they have little impact on the model. D. The posterior distributions Consider that a heterogeneous dataset contains L indepen- dent time series of the tar get network, which are collected under different experimental conditions. It is reasonable to assume that the internal dynamics of the network may vary with e xperimental conditions but the network topology re- mains unchanged. Therefore, impulse responses under differ - ent experimental conditions are independently distributed. Based on Bayes’ rules and after completing squares, the posterior distribution of DSF (8) is as follows: p ( M k , W, β , λ, σ, α | Y ) ∝ p ( Y | W, σ , M k ) " L Y i =1 p ( W i | β , λ, σ, M k ) # p ( σ ) p ( β |M k ) p ( λ |M k ) p ( M k | α ) p ( α ) ∝ L Y j =1 (2 π σ j ) − N j − T 2 exp − 1 2 Y 0 j ( σ j I + Φ j Λ K Φ 0 j ) − 1 Y j × | 2 π Λ K | − 1 2 exp − 1 2 ( W j − µ j ) 0 Σ − 1 j ( W j − µ j ) × σ − a 0 − 1 j exp − b 0 σ j α a 2 − 1 exp {− b 2 α } × α M k ( M k !) − 1 P |M| i =1 α M i ( M i !) − 1 M k Y i =1 b a 1 1 2Γ( a 1 ) λ − a 1 − 1 i exp − b 1 λ i . (17) where subscript j denotes the index of experiments. Note that hyperparameters are shared by dif ferent experiments, reflecting the initial belief that the variation of system dynamics is limited. The number of measurements of the j th experiment is N j . K = bl k diag { K 1 , · · · , K M k } , Λ = diag { λ } ⊗ I T , Σ − 1 j = 1 σ j Φ 0 j Φ j + (Λ K ) − 1 , µ j = 1 σ j Σ j Φ 0 j Y j . (18) According to the full Bayesian model (17), the conditional posterior distributions of the random variables are listed below for further discussion: p ( W | β , λ, σ, α, M k , Y ) = L Y j =1 N ( W j | µ j , Σ j ) , p ( σ | W, β , λ, α , M k , Y ) = L Y j =1 I G ( σ j ; a σ j , b σ j ) , p ( α | W, β , λ, σ, M k , Y ) ∝ α a 2 − 1+ M k ( M k !) − 1 P |M| i =1 α M i ( M i !) − 1 e − b 2 α , (19) where a σ j = a 0 + N j − T 2 , b σ j = b 0 + k Y j − Φ j W j k 2 2 2 . (20) By marginalizing W out from the full Bayesian model (17), the reduced joint posterior distribution of M k , β and λ is as follows: p ( β , λ, M k | σ, α, Y ) ∝ L Y j =1 | σ j I + Φ j Λ K Φ 0 j | − 1 2 exp {− 1 2 Y 0 j ( σ j I + Φ j Λ K Φ 0 j ) − 1 Y j } × α M k ( M k !) − 1 P |M| i =1 α M i ( M i !) − 1 M k Y i =1 b a 1 1 2Γ( a 1 ) λ − a 1 − 1 i exp − b 1 λ i . (21) I V . N E T W O R K I N F E R E N C E U S I N G R J M C M C A. Sampler with fixed topology W ith the full Bayesian model, we are interested in the posterior distribution of model structures, p ( M k | Y ) , by which 6 we can determine the most likely network topology . Given the estimated model structure, we can e valuate impulse responses and noise variance by calculating their e xpectation, which requires e xploring p ( W |M k , Y ) and p ( σ |M k , Y ) . Ho wever , distributions p ( M k | Y ) , p ( W |M k , Y ) and p ( σ |M k , Y ) are in- tractable since the y need to perform high-dimensional inte grals of the nonlinear Bayesian model in (17). T o solve the problem, numerical sampling methods are applied in our framew ork. T o be gin with, assume that the topology of the target network is known a priori (e.g. M k ). Since the dimension of random variables is unchanged, a traditional MCMC algorithm is sufficient to draw samples from the distrib ution (17). T o improv e the con vergence property , some random quantities are marginalized out in certain sampling steps. The resulting MH-within-PCG sampler is designed following the rules of marginalization, permutation and trimming. The sampler is further modified to explore the network with unkno wn topol- ogy . Gibbs sampling is accepted to construct the basic sampler (Sampler 1 ) for drawing samples. Sampler 1 : Blocked Gibbs sampler 1: Sample p ( W t +1 | β t , λ t , σ t , α t , M k , Y ) 2: Sample p ( β t +1 , λ t +1 | W t +1 , σ t , α t , M k , Y ) 3: Sample p ( σ t +1 | W t +1 , β t +1 , λ t +1 , α t , M k , Y ) 4: Sample p ( α t +1 | W t +1 , β t +1 , λ t +1 , σ t +1 , M k , Y ) Since the distributions of steps 2 and 4 are known up to a normalization constant, these tw o sampling steps should be replaced by the MH method, leading to a MH-within- Gibbs sampler (Sampler 2). Such a replacement maintains the in variant distribution because no marginal distributions are called in the sampler . Sampler 2 : MH-within-Gibbs sampler 1: Sample p ( W t +1 | β t , λ t , σ t , α t , M k , Y ) 2: Sample p ( β t +1 , λ t +1 | W t +1 , σ t , α t , M k , Y ) using MH 3: Sample p ( σ t +1 | W t +1 , β t +1 , λ t +1 , α t , M k , Y ) 4: Sample p ( α t +1 | W t +1 , β t +1 , λ t +1 , σ t +1 , M k , Y ) using MH As indicated in (21), one can mar ginalize W out from distribution p ( β , λ | W, σ , α, M k , Y ) in step 2 of Sampler 2. According to the rule of marginalization, the reduced sampling step is followed immediately by a direct draw from the conditional distrib ution of w , resulting in a MH-within-PCG sampler (Sampler 3). The quantity that is not conditioned on in the next step is labelled with an asterisk. Sampler 3 : Marginalization 1: Sample p ( W ? | β t , λ t , σ t , α t , M k , Y ) 2: Sample p ( β t +1 , λ t +1 | σ t , α t , M k , Y ) using MH 3: Sample p ( W t +1 | β t +1 , λ t +1 , σ t , α t , M k , Y ) 4: Sample p ( σ t +1 | W t +1 , β t +1 , λ t +1 , α t , M k , Y ) 5: Sample p ( α t +1 | W t +1 , β t +1 , λ t +1 , σ t +1 , M k , Y ) using MH T o further simply Sampler 3, permutation and trimming are applied. Note that in order to maintain the in variant distribu- tion, steps 2 and 3 can neither be separated nor swapped. Since the sampled W in step 1 is not conditioned on in step 2, step 1 is trimmed out safely . The resulting sampler is presented in Sampler 4. Sampler 4 : Permutation and Trimming 1: Sample p ( β t +1 , λ t +1 | σ t , α t , M k , Y ) using MH 2: Sample p ( W t +1 | β t +1 , λ t +1 , σ t , α t , M k , Y ) 3: Sample p ( σ t +1 | W t +1 , β t +1 , λ t +1 , α t , M k , Y ) 4: Sample p ( α t +1 | W t +1 , β t +1 , λ t +1 , σ t +1 , M k , Y ) using MH The sampling steps of Sampler 4 can be rearranged in many other ways. For example, Sampler 4 can be modified as 4 → 3 → 1 → 2 . Howe ver , not all the arrangements are valid. For instance, sequence 2 → 1 → 3 → 4 deriv ed from trimming out step 3 of Sampler 3 is incorrect because it violates the rule of trimming. The point is that for a PCG sampler , its sampling steps cannot be replaced by their MH counterparts directly . Otherwise, the in variant distribution may be changed. T o a void this type of error , it is necessary to deduce a MH- within-PCG sampler step-by-step. According to (19), steps 2 and 3 of Sampler 4 can be implemented directly . Only steps 1 and 4 require further discussion. Since these tw o steps employ the MH approach, the proposal distributions for Marko v states are first designed to produce candidate samples. Since λ are non-negati ve, a truncated Gaussian distribution is adopted to draw the proposal. F or a Gaussian distribution of θ with mean µ 0 and variance σ 0 , its truncated probability density function on ( l , u ) is: p N ( θ ; µ 0 , σ 0 , l, u ) = f ( θ − µ 0 σ 0 ) σ 0 h F ( u − µ 0 σ 0 ) − F ( l − µ 0 σ 0 ) i , (22) where f ( x ) = 1 √ 2 π e − x 2 2 , F ( x ) = 1 2 1 + erf ( x √ 2 ) , er f ( x ) = 2 √ π Z x 0 e − t 2 dt. (23) The proposal distribution for λ is q ( λ | λ t ) = Q M k i =1 p N ( λ i ; λ t i , 0 . 05 , 0 , + ∞ ) . In order to av oid the high rejection rate, the proposed λ p only deviate from the current state λ t with small v ariance. For hyperparameter β , the proposal of each element is drawn from the distribution independently as follows. For a random variable θ ∈ ( l , u ) with its expected value ¯ θ : p U ( θ ; ¯ θ , l , u, ε ) = U ( ¯ θ − ε 2 , ¯ θ + ε 2 ) l + ε 2 < ¯ θ < u − ε 2 U ( l, l + ε ) ¯ θ ≤ l + ε 2 U ( u − ε, u ) ¯ θ ≥ u − ε 2 . (24) where U ( a, b ) is the uniform distribution on ( a, b ) . ε is the se- lection window for sampling. Hence, the proposal distribution for β i is: T C /S S : q ( β i | β t i ) = p U ( β i ; β t i , 0 , 1 , 0 . 1) D C : q ( β i | β t i ) = p U ( β i 1 ; β t i 1 , 0 , 1 , 0 . 1) p U ( β i 2 ; β t i 2 , − 1 , 1 , 0 . 1) . (25) 7 According to ’ detailed balance’, the acceptance probability for step 1 is calculated as follo ws: A U ( β p , λ p | β t , λ t ) = min 1 , p ( β p , λ p | σ t , α t , M k , Y ) q ( β t | β p ) q ( λ t | λ p ) p ( β t , λ t | σ t , α t , M k , Y ) q ( β p | β t ) q ( λ p | λ t ) = min 1 , r U ( β p , λ p | β t , λ t ) . (26) where r U ( β p , λ p | β t , λ t ) = L Y j =1 exp {− 1 2 Y 0 j ( σ t j I + Φ j Λ p K p Φ 0 j ) − 1 Y }| σ j I + Φ j Λ p K p Φ 0 j | − 1 2 exp {− 1 2 Y 0 j ( σ t j I + Φ j Λ t K t Φ 0 j ) − 1 Y j }| σ j I + Φ j Λ t K t Φ 0 j | − 1 2 × M k Y i =1 λ p i λ t i − a 1 − 1 exp b 1 ( λ p i − λ t i ) λ p i λ t i 1 + erf ( λ t i √ 2 σ 0 ) 1 + erf ( λ p i √ 2 σ 0 ) . (27) In practice, the v ariance of the proposal distributions (i.e. σ 0 and ε ) is tuned during inference so that the acceptance accounts for 40% of the total iterations, according to a heuristic rule in [32]. The MH sampling step for α in Sampler 4 is designed in the same way . The proposal is drawn from a Gamma distribution: q ( α | α t ) ∝ α a 2 − 1+ M k e − (1+ b 2 ) α . Therefore, the acceptance probability is: A ( α p | α t ) = min ( 1 , e − α t P |M| i =1 ( α t ) M i ( M i !) − 1 e − α p P |M| i =1 ( α p ) M i ( M i !) − 1 ) . (28) Note that step 4 is independent on the other sampling steps. That is because hyperparameter α is only related to M k that is pre-fixed in this section. As a result, step 4 can be remov ed from the sampler without af fecting the conv ergence property . Howe ver , if M k needs to be sampled (unknown topology), step 4 must be retained in the sampler . B. Sampler with unknown topology If the netw ork topology is unknown, M k is treated as a random v ariable and needs to be sampled for topology detection. T o sample from (17), a blocked Gibbs sampler (Sampler 5) is applied as the last section. Since the dimension of W , β and λ is dependent on M k , these random variables are grouped together . Sampler 5 : Blocked Gibbs sampler 1: Sample p ( W t +1 , β t +1 , λ t +1 , M t +1 k | w t +1 , σ t , α t , Y ) 2: Sample p ( σ t +1 | W t +1 , β t +1 , λ t +1 , α t , M t +1 k , Y ) 3: Sample p ( α t +1 | W t +1 , β t +1 , λ t +1 , σ t +1 , M t +1 k , Y ) Since the distributions of steps 1 and 3 in Sampler 5 cannot be sampled directly , these two steps are implemented using the MH method, leading to a MH-within-Gibbs sampler (Sampler 6). Sampler 6 : MH-within-Gibbs sampler 1: Sample p ( W t +1 , β t +1 , λ t +1 , M t +1 k | σ t , α t , Y ) using MH 2: Sample p ( σ t +1 | W t +1 , β t +1 , λ t +1 , α t , M t +1 k , Y ) 3: Sample p ( α t +1 | W t +1 , β t +1 , λ t +1 , σ t +1 , M t +1 k , Y ) using MH According to the rule of marginalization, after marginalizing w out from step 1, one must sample w immediately from its conditional distribution in the next step. The resulting MH- within-PCG sampler is shown in Sampler 7. Sampler 7 : MH-within-PCG 1: Sample p ( β t +1 , λ t +1 , M t +1 k | σ t , α t , Y ) using MH 2: Sample p ( W t +1 | β t +1 , λ t +1 , σ t , α t , M t +1 k , Y ) 3: Sample p ( σ t +1 | W t +1 , β t +1 , λ t +1 , α t , M t +1 k , Y ) 4: Sample p ( α t +1 | W t +1 , β t +1 , λ t +1 , σ t +1 , M t +1 k , Y ) using MH Since M k is fixed in steps 2, 3 and 4 of Sampler 7, the dimension of their random variables is unchanged. As a result, the sampling steps of Sampler 4 can be applied here directly . Sampler 7 explores different model structures (topologies) via step 1. Since the sampling space of step 1 is composed of multiple subspaces of dif fering dimensionality , the sampler must be capable of trav ersing these subspaces in order to explore the Bayesian model sufficiently . T o wards this point, the RJMCMC scheme is applied in this section. T o realize effecti ve jumps between the parameter subspaces, different types of mov es are proposed for the Marko v chain. Successful moves should both allow the Markov chain to visit all possible subspaces and promote reasonable acceptance probability . Motiv ated by this idea, we come up with three types of mov es as follows: Birth Mov e : The number of links in the next state, M t +1 k is one more than that of the current state (i.e. M t +1 k = M t k + 1 ). Furthermore, the zero structure of M t +1 k and M t k only differs at one entry . F or example, the Boolean structure of M t k is [ 1 0 0 ] and that of M t +1 k is [ 1 0 1 ] . Death Mov e : The number of links in the next state, M t +1 k is one less than that of the current state (i.e. M t +1 k = M t k − 1 ). Furthermore, the zero structure of M t +1 k and M t k only differs at one entry . F or example, the Boolean structure of M t k is [ 1 0 1 ] and that of M t +1 k is [ 1 0 0 ] . Update Move : The topology of the network is unchanged in the next state (i.e. M t +1 k = M t k ) but the other random variables are updated. The birth and death moves of RJMCMC encourage a global search of the parameter subspaces, leading to a thorough exploration of network topology . The death move is equiv alent to setting an ARD parameter λ i to zero whilst the birth move rev erses this process by retrieving non-zero λ i . As a result, the ef fect of ARD is maximally acti vated. The update mo ve inherently infers internal dynamics of the network to interpret the dataset. Kernel-based methods often apply ARD for topology detec- tion, where the sparsity profile of ADR parameters determines network topology . Nev ertheless, due to local optimal solutions and numerical errors, the estimated ARD parameters are often not strictly zero. One can try different initial points to somehow avoid local optima or emplo y certain model selection strategies (e.g. backward selection) to enforce sparsity . Ne v- ertheless, since these schemes either pick up local and global optima equally likely or implement algorithms repeatedly , they raise computational cost and can be highly inefficient. 8 The main advantage of RJMCMC is that it explores the parameter space in a highly effecti ve way . The jump proposed by RJMCMC is not al ways accepted. Rather , the acceptance probability inv olves the trade-of f between data-fitting and sparsity penalties as, for example, r U contains the ratio of cost functions of dif ferent model structures: these cost func- tions are minimized in the kernel-based methods. Compared with the KEB approaches that mainly search one parameter space of the highest dimensionality ( M 1 ), RJMCMC does not necessarily step into all parameter subspaces, meaning that the Markov chain (that consists of accepted Markov states) only contains not all but a number of model structures embedded in the lo wer-dimensional subspaces. In addition, some model structures are only visited with lo w frequenc y , implying they are unlikely to be the ground truth. As a result, RJMCMC is able to focus on exploring other parameter subspaces whose corresponding model structures are closer to the ground truth. Therefore, many local optima of impulse responses and hyperparameters are avoided. Compared with KEB, RJMCMC greatly increases inference accuracy and improv es computational efficienc y . T o realize birth and death mov es, the following algorithms (Algorithm 1 and 2) are proposed. Algorithm 1 Birth Mov e 1: W ith probability P B , choose Birth move. 2: Select a node to be added to the current topology randomly by the Uniform distribution: q B ( i |M t k ) = 1 M 1 − M t k . 3: Draw proposals β p i and λ p i from q B ( β i , λ i ) = q B ( β i ) q B ( λ i ) with β t and λ t unchanged, where q B ( λ i ) = I G ( λ i ; a 1 , b 1 ) T C /S S : q B ( β i ) = U ( β i ; 0 , 1) D C : q B ( β i ) = U ( β i 1 ; 0 , 1) U ( β i 2 ; − 1 , 1) (29) 4: Accept with probability A B . Combine β p i and λ p i with β t and λ t to generate β t +1 and λ t +1 if accepted. Algorithm 2 Death Mov e 1: W ith probability P D , choose Death move. 2: Select a node to be remov ed from the current topology ran- domly by the Uniform distribution: q D ( i |M t k ) = 1 M t k − 1 where the auto-regression terms are always retained. 3: Remove β t i and λ t i from β t and λ t respectiv ely with other elements unchanged. 4: Accept with probability A D . The acceptance probability for birth and death moves is calculated based on ’ detailed balance’: A B ( β p , λ p , M p k | β t , λ t , M t k ) = min { 1 , r B ( β p , λ p , M p k | β t , λ t , M t k ) } A D ( β p , λ p , M p k | β t , λ t , M t k ) = min { 1 , r D ( β p , λ p , M p k | β t , λ t , M t k ) } , (30) where r B ( β p , λ p , M p k | β t , λ t , M t k ) = L Y j =1 exp {− 1 2 Y 0 j ( σ t j I + Φ j Λ p K p Φ 0 j ) − 1 Y j } exp {− 1 2 Y 0 j ( σ t j I + Φ j Λ t K t Φ 0 j ) − 1 Y j } | σ t j I + Φ j Λ p K p Φ 0 j | − 1 2 | σ t j I + Φ j Λ t K t Φ 0 j | − 1 2 × P D P B α t ( M 1 − M t k ) M p k ( M p k − 1) r D ( β p , λ p , M p k | β t , λ t , M t k ) = r − 1 B ( β t , λ t , M t k | β p , λ p , M p k ) . (31) Finally , the update mov e is shown in Algorithm 3. Since the topology is fixed, the proposal distributions and acceptance probability are exactly the same with those of Sampler 4. Algorithm 3 Update Mov e 1: W ith probability P U , choose Update move. 2: Propose β p and λ p using the Uniform distribution and truncated Gaussian distributions, respecti vely . 3: Accept with probability A U . Note that the probability of three mov es depends on M t k as follows: P B = 0 . 3 1 < M t k < M 1 0 M t k = M 1 0 . 6 M t k = 1 , P D = 0 . 3 1 < M t k < M 1 0 . 6 M t k = M 1 0 M t k = 1 , P U = 1 − P B − P D . (32) T o conclude, Algorithm 4 presents network inference using RJMCMC. Algorithm 4 RJMCMC for network inference 1: Initialize W 0 , β 0 , λ 0 , σ 0 , α 0 , M 0 k . 2: for t = 1 : t max do 3: Sample P mov e from U (0 , 1) . 4: if P mov e ≤ P B then 5: Execute Birth Move (Algorithm (1)). 6: else if P mov e ≤ P B + P D then 7: Execute Death Move (Algorithm (2)). 8: else 9: Execute Update Move (Algorithm (3)). 10: end if 11: Sample W t according to (19). 12: Sample σ t according to (19). 13: Sample α t from p ( α | W t , β t , λ t , σ t , M t k ) using step 4 of Sampler 4. 14: end for 15: Store { W t } , {M t k } and { σ t } . C. Detection of topology and estimation of model parameters Detection of network topology is based on the posterior distribution of model structures, p ( M k | Y ) . By the merit of RJMCMC, one can estimate the true distrib ution using the empirical distribution constructed by the samples: P ( M k = M i | Y ) = 1 t max t max X t =1 1 M i ( M t k ) , (33) 9 where 1 x ( y ) = 1 y = x 0 y 6 = x . (34) Giv en the empirical distribution (33), the most likely net- work topology is estimated based on maximum a posteriori (MAP): M opt = max M k P ( M k | Y ) . In biology , biologists often prefer to ev aluate the proba- bility of each possible link. The generated topology is fully connected and the confidence of links is measured by their probability . T o achie ve this, one can ev aluate the probability of the link from node j to node i ( j → i ) as follows: P ( j → i | Y ) = |M| X k =1 P ( j → i, M k | Y ) = 1 t max X M q | j → i ∈M q t max X t =1 1 M q ( M t k ) , (35) where j → i ∈ M q means that link j → i is contained in topology M q . Finally , impulse responses and noise variance are estimated for model simulation and prediction as follo ws: ˆ w = E ( w |M k = M opt , Y ) = P t max t =1 1 M opt ( M t k ) w t P t max t =1 1 M opt ( M t k ) , ˆ σ = E ( σ |M k = M opt , Y ) = P t max t =1 1 M opt ( M t k ) σ t P t max t =1 1 M opt ( M t k ) . (36) V . S I M U L A T I O N T o compare our method with KEB inference approaches, we conducted two series of Monte Carlo simulations. Dif ferent kernel functions, including DC, SS and TC kernels were used for inference. KEB solves the following optimization problem. More details can be found in [12]. arg min σ,γ ,β Y 0 ( σ I + ΦΛ K Φ 0 ) − 1 Y + ln | σ I + ΦΛ K Φ 0 | , (37) where Λ ∈ R ( N − T ) × T ( p + m ) contains all ARD parameters whose sparsity profile determines network topology . T o further improv e the detection of zero ARD elements, the backward selection method is used [12]. In the first part of simulations, random DSF networks were generated with different types of topologies (including a ring structure). The y were simulated under v arious noise le vels and inferred using time series data of v arious lengths. T o in vestigate the algorithm performance when inferring real- world networks, our method was further tested on a synthetic gene regulatory network of the circadian clock of Ar abidopsis thaliana . The method was compared with iCheMA, a state- of-the-art approach that was developed to infer biological networks [33]. For DSF networks, two criteria are used to ev aluate the per- formance of algorithms, namely True Positive Rate (TPR) and Precision (PREC). TPR shows the percentage of the true links in the ground truths that are successfully inferred. Precision (PREC) equates to the rate of the correct links ov er all the inferred links. TPR and PREC together indicate the accuracy of inferred networks. If TPR is low , the inferred network misses many true links, thus lacking of useful information; where PREC is low , the generated network is not reliable. T o in vestigate the accuracy of estimated system dynamics, the identified models were applied to predict the validation dataset that was not used for inference. The prediction accurac y is measured based on the metric as follo ws. f itness = 100 1 − k y − ˆ y k y − ¯ y , (38) where y are the validation data of a certain node, ˆ y are the predicted output and ¯ y are the mean of the validation data. The av erage of the fitness of all nodes is calculated for discussion. For the gene regulatory network, the area under the receiv er operating characteristic curve (A UROC) and the area under the precision recall curve (A UPREC) are applied instead. These two criteria are widely used in systems biology to ev aluate the accuracy of inferred biological networks. The receiv er operating characteristic curve and the precision recall curv e are plotted based on the link confidence. The areas under these curves are calculated. A UR OC and A UPREC re veals similar information with TPR and PREC, respectiv ely . A. Random DSF networks 100 netw orks were generated with random topologies and internal dynamics. All networks contained 15 nodes, among which 10 nodes were measured. Each node was independently driv en by an input that w as measured and process noise. T o generate a state space model, a sparse stable matrix A ∈ R 15 × 15 was first yielded randomly using the function spr andn ( n, n, density ) in Matlab. Matrix A was guaranteed to be Hurwitz (i.e. no eigen value was outside the unit circle of the complex plane) using the brute-force strategy . No isolated nodes existed in the network. Figure 1 displays one example of the resulting networks. T o simulate the models, inputs and process noise were both i.i.d. white Gaussian signals. The variance of inputs was fixed to 1 whilst that of process noise v aried. The Signal-to-Noise ratio is defined as S N R = 10 log σ u σ e where σ u and σ e are signal variance of inputs and noise, respectiv ely . Only the first 10 states of the models were measured. The truncation length of impulse responses was set to 20. The time series data were collected for inference with v arious lengths between 45 to 1000 . The av erage TPR and PREC over 100 trials are recorded in T able I – III. In the best-case scenario (no procss noise), RJMCMC outperforms KEB methods in all cases. RJMCMC endowed with dif ference kernel functions present similar re- sults. In particular , gi ven sufficiently long time series ( > 65 ), RJMCMC of fers nearly perfect inference. In contrast, Ker - nel TC presents the weakest result. TPR of Kernel DC stays below 90% . Kernel SS is always outperformed by RJMCMC unless 85 data points are used. As S N R decreases to 10 dB , RJMCMC exhibits a different performance using distinct kernel functions. RJMCMC DC and RJMCMC TC are both superior to RJMCMC SS and they show closely similar performance. In particular , RJM- CMC DC and RJMCMC TC are always capable of producing reliable networks ( P RE C > 97% ). As the number of data 10 -3 -2 -1 0 1 2 3 -1.5 -1 -0.5 0 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10 Fig. 1. The structure of a randomly generated network. Solid lines with arrows represent links. Red circles denote nodes. 45 65 85 60 80 100 Fitness (%) Prediction Fitness (no noise) 45 65 85 -100 -50 0 50 Prediction Fitness (no noise) 100 200 300 40 60 80 Fitness (%) Prediction Fitness (SNR=10dB) 300 500 1000 Number of data points 5 10 15 20 25 Fitness (%) Prediction Fitness (pure noise) Kernel_DC Kernel_SS Kernel_TC RJMCMC_DC RJMCMC_SS RJMCMC_TC Fig. 2. Prediction of randomly generated networks. points increases, the y successfully capture most true links ( T P R ≈ 90% ). As with the pre vious simulation, in all cases KEB methods are no better than RJMCMC. Under the worst-case scenario (no inputs), the performance improv ement of RJMCMC is clearly evident compared with KEB methods. It is remarkable that PREC of RJMCMC DC and RJMCMC TC is al ways abov e 95% , indicating the in- ferred networks are highly reliable. The validation result is shown by the box plot in Figure 2. The adv antage of RJMCMC o ver KEB methods is evident under the best-case scenario whilst in other cases that is not so. The prediction accuracy of RJMCMC is slightly better than KEB methods for S N R = 10 dB and pure noise cases. Nev ertheless, K ernel TC alw ays presents the weakest result. B. Ring networks 100 networks with the fixed ring structure (Figure 3) were generated and simulated follo wing the same protocol of ran- T ABLE I I N F E R E N C E O F R A N D O M NE T W O R K S W I T H N O N O I S E No Noise 45 65 85 PREC TPR PREC TPR PREC TPR Kernel DC 91.2 54.7 95.7 73.5 99.4 84.2 Kernel SS 82.8 60.7 89.6 93.2 99.9 99.9 Kernel TC 50.1 17.0 76.9 29.4 91.2 40.5 RJMCMC DC 99.4 90.5 100 99.3 100 100 RJMCMC SS 93.4 91.2 100 99.3 100 100 RJMCMC TC 99.6 91.6 100 99.3 100 100 T ABLE II I N F E R E N C E O F R A N D O M NE T W O R K S W I T H N O I S E T H A T H A S 10 dB S N R 10dB 100 200 300 PREC TPR PREC TPR PREC TPR Kernel DC 91.9 75.5 97.2 84.0 98.0 87.1 Kernel SS 74.7 82.9 80.7 88.8 87.8 89.0 Kernel TC 87.3 47.5 99.6 67.6 100 74.2 RJMCMC DC 97.0 80.0 98.2 87.4 98.5 89.4 RJMCMC SS 68.3 85.8 80.7 90.1 82.4 93.2 RJMCMC TC 98.1 81.6 98.0 88.5 98.7 90.1 T ABLE III I N F E R E N C E O F R A N D O M NE T W O R K S W I T H P U R E N O I S E No Input 300 500 1000 PREC TPR PREC TPR PREC TPR Kernel DC 81.0 71.5 85.6 73.9 97.2 75.3 Kernel SS 66.4 68.8 80.5 70.9 82.0 72.8 Kernel TC 81.6 59.5 89.2 66.2 93.4 71.3 RJMCMC DC 96.2 69.2 98.8 76.3 97.5 83 RJMCMC SS 78.6 74.3 87.1 76.5 88.4 84.5 RJMCMC TC 95.7 70.9 96.7 76.7 98.9 81.8 dom DSFs. Each node was driven by independent process noise. Only one input entered the network through a single node. Since the network forms a closed feedback loop and is extremely sparse, it is more challenging to infer . T able IV presents the inference result. Simulations indicate that RJMCMC is superior to KEB methods. In particular , RJMCMC DC and RJMCMC TC present the best results. -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 1 2 3 4 5 6 7 8 9 10 Fig. 3. A network with the ring structure. Symbol ’ ∼ ’ denotes the input signals. 11 100 200 300 400 Number of data points -60 -50 -40 -30 -20 -10 0 10 20 30 Fitness (%) Prediction Fitness (10dB) Kernel_DC Kernel_SS Kernel_TC RJMCMC_DC RJMCMC_SS RJMCMC_TC Fig. 4. Prediction of ring networks. PREC of these two cases always exceeds 90% and increases to 98% gi ven 400 data points, while only 1 true link is missed. For KEB methods, either PREC or TPR is lo wer than the corresponding RJMCMC cases. The validation result in Figure 4 shows that RJMCMC outperforms KEB methods, especially giv en lower number of data points. T ABLE IV I N F E R E N C E O F R I N G N E T W O R K S W I T H 10 dB S N R 10dB 100 200 300 400 PREC TPR PREC TPR PREC TPR PREC TPR Kernel DC 54.4 77.0 75.8 82.3 85.0 85.5 90.5 86.5 Kernel SS 41.2 78.5 62.8 83.5 71.8 90.0 71.3 89.0 Kernel TC 76.9 19.3 98.4 39.3 96.8 56.0 97.6 67.8 RJMCMC DC 92.0 67.3 94.5 79.3 94.9 85.5 98.1 88.8 RJMCMC SS 60.7 70.0 81.1 80.3 81.2 85.5 85.9 88.3 RJMCMC TC 91.2 66.3 95.3 80.8 96.6 86.8 98.0 89.3 T o conclude, RJMCMC tremendously improv es the perfor- mance of KEB methods markedly in both inferring netw ork topology and identifying internal dynamics. RJMCMC pro- duces more accurate inference results. The generated networks are highly reliable and contain most true links in the ground truths. C. Synthetic circadian clock network In above two simulations, the ground truth models f all exactly in the proposed model class. Nev ertheless, many of the real-world networks are nonlinear . Under our framework, linear models are used as the approximation in order to deal with unmeasurable nodes. T o check the effecti veness of our method, a synthetic model of the circadian clock (Millar 10 [34]), was employed for test. In addition, we compared our method with a state-of-the-art technique, iCheMA that has been shown to outperform many existing inference methods, including hierarchical Bayesian regression (HBR), LASSO and elastic net through Monte Carlo simulations on the Millar 10 model [33]. Millar 10 describes a circadian clock consisting of 7 genes along with their associated proteins, which amounts to 19 nodes in total. The system is dri ven by light signals. The detailed mathematical model can be found in [34]. The sim- ulation aimed to produce synthetic microarray data. The time window for data collection w as 44 hours. The sampling fre- quency w as 1 hour: as a consequence, only 44 data points were used for each trial. Most importantly , the protein data were not av ailable for inference. Therefore, the network was inferred on the transcriptional level, describing the connectivity among 7 clock genes. The model was simulated for four days of light-dark cycles (LD for one day) follo wed by three days of constant light (LL for one day). The simulation was repeated 50 times. T o av oid the transition due to the initial condition, the simulated data of the first two days were discarded. T ime windows of LDLD (0h-44h), LDLL (24h-68h), LLLL (48h-92h) and steady state (72h-116 h) were adopted for data collection. Considering only 44 data points were av ailable for inference, the length of truncated impulse responses was set to 10 . For the kernel methods, we resorted to [35] to calculate the confidence of inferred links as P ( j → i | Y ) = k w j k k w k . The inference result is presented in T able V. RJMCMC SS outperforms all the other methods in most cases. In particular, under time window LDLL, both A UPREC and A UROC of RJMCMC SS are above 70% . Since this time window con- tains richest light transitions, the inference result indicates that RJMCMC is able to infer complex system dynamics. More importantly , the inference accuracy of RJMCMC is markedly improv ed compared with KEB methods. KEB methods per- form poorly in inferring Millar 10. The inferred networks are unreliable and most true links are missed. iCheMA is only slightly superior to RJMCMC SS under time windo w LLLL and is outperformed by RJMCMC SS in other cases. Simulations imply that our method is reliable when dealing with real-world networks, especially for the cases where full state measurements are una vailable. Therefore, our method can be applied under a wide range of contexts such as biological networks, power grids and communication systems. V I . C O N C L U S I O N This paper combines kernel-based system identification methods and RJMCMC to infer sparse networks. DSF models are used to describe the target network so that the information of hidden nodes is encoded via transfer functions. The models are expressed in a non-parametric way . By doing so, inference can be conducted without prior kno wledge of the number of hidden nodes and their connectivity . The kernel machine is used to impose stable impulse responses. T o sufficiently explore the full Bayesian model, RJMCMC is applied to draw samples from the space that is composed of subspaces of dif- ferent dimensionality . By traversing the subspaces, RJMCMC greatly improves the accuracy of topology detection. Monte Carlo simulations demonstrate our method superior to KEB methods. In particular, the proposed method achieves marked advantages over KEB when inferring synthetic biological networks. Overall, the v alue of this approach is that it alw ays generates reliable inference results and is robust to experimental condi- 12 T ABLE V I N F E R E N C E R E S U LTS O F T H E CI R C A D I A N CL O C K M O D E L . LDLD LDLL LLLL Steady State A UROC A UPREC A UROC A UPREC A UROC A UPREC A UROC A UPREC iCheMA 66.4 % 62.3% 65.4% 64.2% 69.4% 66.8% 64.7% 56.4% Kernel DC 54.9 % 45.4% 63.1 % 55.3% 53.6 % 39.4% 48.9% 35.0% Kernel SS 51.3 % 38.8% 63.3 % 51.6% 54.2 % 39.7% 51.7% 37.4% Kernel TC 48.6 % 36.5% 55.5 % 43.7% 51.7 % 38.9% 47.4% 35.5% RJMCMC DC 64.8 % 61.1% 68.7 % 66.7% 61.8 % 57.2% 58.2% 54.0% RJMCMC SS 69.4 % 63.8% 76.5 % 73.4% 66.3 % 62.7% 64.6% 62.8% RJMCMC TC 68.0 % 61.8% 72.2 % 68.9% 61.2 % 57.0% 59.1% 54.5% tions, including the number of data points, types of topologies and noise le vels. Giv en a suf ficient data source, our method is able to infer most true links and is applicable to a wide range of real-w orld netw orks, where full state measurements are not av ailable. According to the simulations, our method can be used to study circadian clocks. For example, it can be applied to infer the C a 2+ signalling network of Arabidopsis . The method does, ho wev er , ha ve some limitations, as fol- lows. The computational cost is heavy when dealing with large-scale networks. Furthermore, the method for identifi- cation of continuous time systems requires high sampling frequency and equal sampling steps. In addition, DSF is not well-defined for stochastic differential equations (SDE) since the W iener process in the model is almost surely nowhere differentiable. Further work is required to extend the method to continuous time networks described by SDE. R E F E R E N C E S [1] C. E. Rasmussen and C. K. I. W illiams, Gaussian processes for machine learning. The MIT Press, 2006. [2] G. W ahba, Spline Models for Observational Data . SIAM: Society for Industrial and Applied Mathematics, 1990. [3] C. Bishop., P attern r ecognition and machine learning . Springer New Y ork, 2006. [4] K. P . Murphy , Machine Learning: A Pr obabilistic P erspective . The MIT Press, 2012. [5] T . Chen, H. Ohlsson, and L. Ljung, “On the estimation of transfer functions, re gularizations and gaussian processes-revisited, ” Automatica , v ol. 48, no. 8, pp. 1525 – 1535, 2012. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0005109812001999 [6] G. Pillonetto, A. Chiuso, and G. Nicolao, “Prediction error identification of linear systems: A nonparametric gaussian regression approach, ” Automatica , v ol. 47, no. 2, pp. 291 – 305, 2011. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0005109810004875 [7] G. Pillonetto, F . Dinuzzo, T . Chen, G. D. Nicolao, and L. Ljung, “Kernel methods in system identification, machine learning and function estimation: A survey , ” Automatica , vol. 50(3), pp. 657–682, 2014. [8] G. Pillonetto and G. D. Nicolao, “ A new kernel-based approach for linear system identification, ” Automatica , vol. 46(1), pp. 81–93, 2010. [9] G. Bottegal, A. Y . Aravkin, H. Hjalmarsson, and G. Pillonetto, “Robust em kernel-based methods for linear system identification, ” Automatica , vol. 67, pp. 114 – 126, 2016. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0005109816000376 [10] G. Pillonetto, M. H. Quang, and A. Chiuso, “ A ne w kernel-based approach for nonlinear system identification, ” IEEE T ransactions on Automatic Contr ol , vol. 56, no. 12, pp. 2825–2840, Dec 2011. [11] T . Chen, M. Andersen, L. Ljung, A. Chiuso, and G. Pillonetto, “System identification via sparse multiple kernel-based regularization using se- quential conv ex optimization techniques, ” Automatica , vol. 59(11), pp. 1–33, 2014. [12] A. Chiuso and G. Pillonetto, “ A Bayesian approach to sparse dynamic network identification, ” Automatica , pp. 1553–1565, 2012. [13] G. Pillonetto and A. Chiuso, “T uning complexity in regularized kernel-based regression and linear system identification: The rob ustness of the marginal likelihood estimator, ” Automatica , vol. 58, pp. 106 – 117, 2015. [Online]. A vailable: http://www .sciencedirect.com/science/ article/pii/S0005109815002113 [14] P . Green, “Re versible jump mark ov chain monte carlo computation and bayesian model determination, ” Biometrika , vol. 82, pp. 711–732, 1995. [15] S. Brooks and B. Morg an, “Optimization using simulated annealing, ” Journal of the Royal Statistical Society , vol. 44, pp. 241–257, 1995. [16] S. P . Brooks, N. Friel, and R. King, “Classical model selection via simulated annealing, ” Journal of the Royal Statistical Society , vol. 65, pp. 503–520, 2003. [17] C. Andrieu, N. D. Freitas, and A. Doucet, “Re versible jump mcmc simulated annealing for neural networks,, ” in Uncertaintity in Artificial Intelligence Pr oceedings) , 2000, pp. 11–18. [18] C. Andrieu, N. de Freitas, and A. Douc, “Robust full bayesian learning for radial basis networks, ” Neural computation , vol. 13, pp. 2359–2407, 2001. [19] C. Andrieu and A. Doucet, “Joint bayesian model selection and estima- tion of noisy sinusoids via reversible jump mcmc, ” EEE T ransactions on Signal Processing , vol. 47, pp. 2667–2676, 1999. [20] T . Baldacchino, S. R. Anderson, and V . Kadirkamanathan, “Computa- tional system identification for bayesian narmax modelling, ” Automatica , vol. 49, pp. 2641–2651, 2013. [21] J. V ermaak, C. Andrieu, A. Doucet, and S. J. Godsill, “Rev ersible jump markov chain monte carlo strategies for bayesian model selection in autoregressi ve processes, ” Journal of T ime Series Analysis , vol. 25, pp. 785–809, 2004. [22] G. O. Roberts and S. K. Sahu, “Updating schemes, correlation structure, blocking and parameterization for the gibbs sampler , ” Journal of the Royal Statistical Society , vol. 59, pp. 291–317, 1997. [23] A. A. Johnson, G. L. Jones, and R. C. Neath, “Component-wise markov chain monte carlo: Uniform and geometric ergodicity under mixing and composition, ” Statistical Science , vol. 28, pp. 360–375, 2013. [24] D. A. van Dyk and X. Jiao, “Metropolis-hastings within partially collapsed gibbs samplers, ” Journal of Computational and Graphical Statistics , vol. 24, pp. 301–327, 2015. [25] D. A. van Dyk and T . Park, “Partially collapsed gibbs samplers: Theory and methods, ” Journal of the American Statistical Association , vol. 103, pp. 790–796, 2008. [26] D. I. Hastie, “T owards automatic reversible jump markov chain monte carlo, ” Ph.D. dissertation, University of Bristol, 2005. [27] R. W aagepetersen and D. Sorensen, “ A tutorial on reversible jump mcmc with a vie w to ward applications in qtl-mapping, ” International Statistical Review , vol. 69, no. 1, pp. 49–61, 2007. [Online]. A vailable: https: //onlinelibrary .wiley .com/doi/abs/10.1111/j.1751- 5823.2001.tb00479.x [28] Y . Y uan, G. Stan, S. W arnick, and J. Gonc ¸ alves, “Robust dynamical network structure reconstruction, ” Automatica , vol. 47, pp. 1230–1235, 2011. [29] Y . Y uan, A. Rai, E. Y eung, G. Stan, S. W arnick, and J. Gonc ¸ alves, “ A minimal realization technique for the dynamical structure function of a class of lti systems, ” IEEE Tr ansactions on Contr ol of Network Systems , vol. 4, no. 2, pp. 301–311, June 2017. [30] J. Gonc ¸ alves and S. W arnick, “Necessary and sufficient conditions for dynamical structure reconstruction of lti networks, ” IEEE T rans. Autom. Contr ol , vol. 53(7), pp. 1670–1674, 2008. [31] F . Dinuzzo, “Kernels for linear time in variant system identification, ” SIAM Journal on Contr ol and Optimization , v ol. 53, no. 5, pp. 3299– 3317, 2015. [Online]. A vailable: https://doi.org/10.1137/130920319 [32] C. Robert and G. Casella, Monte Carlo Statistical Methods . Springer, 2004. 13 [33] A. Aderhold, D. Husmeier, and M. Grzegorczyk, “ Approximate bayesian inference in semi-mechanistic models, ” Statistics and Computing , vol. 27, p. 1003, 2017. [34] A. Pokhilko, S. Hodge, K. Stratford, K. Knox, K. Edwards, A. Thomson, T . Mizuno, and A. Millar, “Data assimilation constrains new connections and components in a complex, eukaryotic circadian clock model. ” Molecular systems biology , vol. 6, pp. 1–10, 2010. [35] A. Aderhold, D. Husmeier , and M. Grzegorczyk, “Statistical inference of regulatory networks for circadian regulation, ” Statistical Applications in Genetics and Molecular Biology , vol. 13, pp. 227–273, 2014.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment