Disco v ering Graphical Granger Causalit y Using the T runcating Lasso P enalt y Ali Sho jaie and George Mic hailidis Departmen t of Statistics, Universit y of Mic higan Abstract Comp onen ts of biological systems interact with eac h other in order to carry out vital cell functions. Suc h information can b e used to impro v e estimation and inference, and to obtain b etter insights into the underlying cellular mec hanisms. Disco vering regulatory interactions among genes is therefore an im- p ortan t problem in systems biology . Whole-genome expression data ov er time pro vides an opportunity to determine how the expression levels of genes are affected b y changes in transcription levels of other genes, and can therefore b e used to discov er regulatory interactions among genes. In this pap er, we prop ose a no vel penalization metho d, called trunc ating lasso , for estimation of causal relationships from time-course gene expression data. The proposed penalty can correctly determine the order of the un- derlying time series, and impro ves the p erformance of the lasso-type estimators. Moreov er, the resulting estimate pro vides information on the time lag betw een activ ation of transcription factors and their effects on regulated genes. W e provide an efficient algorithm for estimation of model parameters, and show that the proposed method can consisten tly discov er causal relationships in the large p , small n setting. The p erformance of the prop osed mo del is ev aluated fa vorably in simulated, as w ell as real, data examples. The proposed truncating lasso method is implemented in the R-pack age grangerTlasso and is a v ailable at www.stat.lsa.umich.edu/ ∼ sho jaie . 1 In tro duction A critical problem in systems biology is to disco ver causal relationships among comp onents of biological sys- tems. Gene regulatory netw orks, metabolic net w orks and cell signalling net works capture causal relationships in cells. Discov ery of causal relationships ma y be only p ossible through carefully designed exp erimen ts, which can b e challenging. Ho w ever, gene regulation is carried out b y binding of protein pro ducts of transcription factors to cis -regulatory elemen ts of genes. Suc h regulatory mechanisms are evident if the expression lev els of gene X is affected by changes in expression levels of gene Y . Therefore, time course gene expression data can be used to disco ver causal relationships among genes and construct the gene regulatory net work. Differen t metho ds hav e b een developed to infer causal relationships from time series data, including dynamic Bay esian Netw orks ( Murphy , 2002 ) and Granger causalit y ( Granger , 1969 ). In dynamic Ba yesian Net works (DBNs) the state space of Bay esian Netw orks is expanded by replicating the set of v ariables in the netw ork b y the num b er of time p oints. Cyclic net works are then transformed to directed acyclic graphs (D AGs) b y breaking down cycles into in teractions b etw een v ariables at t wo differen t time p oints. Ong et al. ( 2002 ) and P errin et al. ( 2003 ) among others ha ve applied DBNs to infer causal relationships among comp onen ts of biological systems. On the other hand, the concept of Granger causality states that gene X is Granger-causal for gene Y if the autoregressive mo del of Y based on past v alues of both genes is significan tly more accurate than the mo del based on Y alone. This implies that c hanges in expression levels of genes could b e explained by expression lev els of their transcription factors. Therefore, statistical metho ds can b e applied to time-course gene expression observ ations to estimate Granger causality among genes. Exploring Granger causality is closely related to analysis of vector autoregressiv e (V AR) mo dels, whic h are widely used in econometrics. Y amaguchi et al. ( 2007 ) and Opgen-Rhein and Strimmer ( 2007 ) employ ed 1 V AR mo dels to learn gene regulatory netw orks, while F ujita et al. ( 2007 ) prop osed a sparse V AR mo del for b etter performance in cases when the num b er of genes, p is large compared to the sample size, n . Similar sparse models hav e also b een considered b y Mukhopadhy a y and Chatterjee ( 2007 ). Zou and F eng ( 2009 ) compared the p erformance of DBNs and Granger causality metho ds for estimation of causal relationships and concluded that the p erformance of the tw o approaches dep end on the length of the time series as w ell as the sample size. The findings of Zou and F eng ( 2009 ) emphasizes the need for sparse mo dels in cases where the sample size is small. In particular, when p n , p enalized methods often provide b etter prediction accuracy . Arnold et al. ( 2007 ) applied the lasso (or ` 1 ) p enalty to discov er the structure of graphical mo dels based on the concept of Granger causality and studied the relationship b etw een differen t k ey p erformance indicators in analysis of sto c k prices. Asymptotic and empirical p erformances of the lasso p enalt y for disco very of graphical mo dels ha ve b een studied b y many researchers and a num b er of extensions of the original p enalty hav e b een prop osed (we refer to these v arian ts of the lasso p enalty as “lasso-type” penalties). In particular, to reduce the bias in the lasso estimates, Zou ( 2006 ) proposed the adaptiv e lasso penalty , and show ed that for fixed p , if appropriate w eights are used, the adaptive lasso p enalty can achiev e v ariable selection consistency even if the so-called irr epr esentability assumption is violated. In fact, it can also b e shown that if initial weigh ts are derived from regular lasso estimates, the adaptive lasso penalty is also consisten t for v ariable selection in high dimensional sparse settings ( Sho jaie and Mic hailidis , 2010b ). The lasso estimate of the graphical Granger mo del may result in a mo del in which X is considered to influence Y in a num b er of different time lags. Suc h a mo del is hard to interpret and inclusion of additional co v ariates in the mo del may result in po or mo del selection performance. Lozano et al. ( 2009 ) ha ve recently prop osed to use a group lasso p enalt y in order to obtain a simpler Granger graphical mo del. The group lasso p enalt y takes the a v erage effect of X on Y o ver differen t time lags and considers X to be Granger-causal for Y if the av erage effect is significan t. How ever, this results in significant loss of information, as the time difference b et w een activ ation of X and its effect on Y is ignored. Moreov er, due to the av eraging effect, the sign of effects of the v ariables on eac h other can not b e determined from the group lasso estimate. Hence, whether X is an activ ator or a suppressor for Y and/or the magnitudes of its effect remain unknown. In this pap er, we prop ose a no v el trunc ating lasso p enalty for estimation of graphical Granger models. The prop osed penalty has tw o main features: (i) it automatically determines the order of the V AR model, i.e. the n umber of effective time lags and (ii) it p erforms mo del simplification by reducing the num b er of co v ariates in the mo del. W e prop ose an efficient iterative algorithm for estimation of model parameters, pro vide an error-based c hoice for the tuning parameter and prov e the consistency of the resulting estimate, b oth in terms of sign of the effects, as well as, v ariable selection prop erties. The prop osed metho d is applied to simulated and real data examples, and is shown to pro vide b etter estimates than alternativ e penalization metho ds. The remainder of the pap er is organized as follows. Section 2 , starts with a discussion of the use of lasso- t yp e p enalties for estimation of DA Gs as well as a review of the concept of graphical Granger causality . The prop osed truncating lasso p enalty and asymptotic prop erties of the estimator are discussed in section 2.3 , while the optimization algorithm is presented in section 2.5 . Results of sim ulation studies are presen ted in section 3.1 and applications of the prop osed mo del to time course gene expression data on E-coli and human cancer cell line (HeLa cells) are illustrated in sections 3.2 and 3.3 , resp ectively . A summary of findings and directions for future research are discussed in section 4 . 2 Mo del and Metho ds 2.1 Graphical Mo dels and Penalized Estimates of D A Gs Consider a graph G = ( V , E ), where V corresp onds to the set of no des with p elemen ts and E ⊂ V × V is the edge set. The no des of the graph represent random v ariables X 1 , . . . , X p and the edges capture associations amongst them. An edge is called directed if ( i, j ) ∈ E ⇒ ( j, i ) / ∈ E and undirected if ( i, j ) ∈ E ⇐ ⇒ ( j, i ) ∈ E . W e represent E through the adjacency matrix A of the graph, a p × p matrix whose ( j, i ) − th entry indicates 2 whether there is an edge (and its weigh t) betw een nodes j and i . Causal relationships among v ariables are represen ted b y directed graphs where E consists of only directed edges. Let pa i denote the set of p ar ents of no de i and for j ∈ pa i , write j → i . The causal effect of random v ariables in a DA G can b e explained using structur al e quation mo dels ( Pearl , 2000 ), where each v ariable is mo deled as a (nonlinear) function of its parents. The general form of these mo dels is given by: X i = f i (pa i , Z i ) , i = 1 , . . . , p (2.1) The random v ariables Z i are the laten t v ariables representing the unexplained v ariation in eac h node. T o mo del the asso ciation among nodes of a DA G, w e consider a simplification of ( 2.1 ) where f i is linear. More sp ecifically , let ρ ij represen t the effe ct of gene j on i for j ∈ pa i , then X i = X j ∈ pa i ρ ij X j + Z i , i = 1 , . . . , p (2.2) In the sp ecial case, where the random v ariables on the graph are Gaussian, equations ( 2.1 ) and ( 2.2 ) are equiv alen t in the sense that ρ ij are the coefficients of the linear regression mo del of X i on X j , j ∈ pa i . It is kno wn in the normal case that ρ ij = 0 , j / ∈ pa i . F or the case of DA Gs, it can be sho wn that when the v ariables inherit a natural ordering, the likelihoo d function can be directly written in terms of the adjacency matrix of the D AG. It then follo ws that the p enalized estimate of the adjacency matrix can b e found by solving p − 1 p enalized regression problems. T o see this, let X b e the n × p matrix of observ ations and S = n − 1 X T X b e the empirical co v ariance matrix. Then, the estimate of the adjacency matrix of DA Gs under the general w eighted lasso (or ` 1 ) p enalty , is found b y solving the following ` 1 -regularized least squares problems for i = 2 , . . . , p ˆ A i, 1: i − 1 = argmin θ ∈ R i − 1 n − 1 kX i − X 1: i − 1 θ k 2 2 + λ i i − 1 X j =1 | θ j | w ij (2.3) where A i, 1: i − 1 denotes the first i − 1 elemen t of the i th ro w of the adjacency matrix and w ij represen ts the w eights. F or the lasso p enalty w ij = 1 and in case of adaptive lasso w ij = 1 ∨ | ˜ A ij | − 1 where ˜ A are the initial estimates obtained with the regular lasso p enalty . 2.2 Graphical Granger Causality Let X 1: T = { X } T t =1 and Y 1: T = { Y } T t =1 , b e tra jectories of tw o sto chastic pro cesses X and Y up to time T and consider the following tw o regression mo dels: Y T = AY 1: T − 1 + B X 1: T − 1 + ε T (2.4) Y T = AY 1: T − 1 + ε T (2.5) Then X is said to be Granger-causal for Y if and only if the model 2.4 results in significan t improv ements o ver mo del 2.5 . Graphical Granger models extend the notion of Granger causalit y among tw o v ariables to p v ariables. In general, let X 1 , . . . , X p b e p sto chastic processes and denote b y X the rearrangement of these sto c hastic pro cesses in to a v ector time series, i.e. X t = ( X t 1 , . . . , X t p ) T . W e consider mo dels of the form X T = A 1 X T − 1 + . . . A T − 1 X 1 + ε T . (2.6) In the graphical Granger mo del, X t j is said to b e Granger-causal for X T i if the corresp onding coefficient, A t i,j is statistically significant. In that case, there exists an edge X t j → X T i in the graphical mo del with T × p no des. Suc h a mo del corresponds to a DA G with T × p v ariables, in which the ordering of the set of p -v ariate v ectors X 1 , . . . , X T is determined by the temp oral index and the ordering among the elemen ts of each v ector is arbitrary . Lasso-t yp e estimates of DA Gs can therefore be used in the con text of graphical Granger mo dels in order to estimate the effects of v ariables on each other. The mo del in ( 2.6 ) is also equiv ale n t to V ector Auto-Regressiv e (V AR) models ( L¨ utk ep ohl , 2005 , Chapter 2), whic h ha ve b een used for estimation of graphical Granger causality by a n umber of researc hers, including Arnold et al. ( 2007 ). 3 2.3 T runcating Lasso for Graphical Granger Mo dels Consider a graphical mo del with p v ariables, observed ov er T time p oints, and let d b e the order of the V AR mo del or the effectiv e num b er of time lags (in ( 2.6 ) d = T − 1). As in section 2.1 , let X t denote the design matrix corresponding to t -th time p oint, and X t i b e its i -th column. The truncating l asso estimate of the graphical Granger model is found b y solving the following estimation problem for i = 1 , . . . , p : argmin θ t ∈ R p n − 1 kX T i − d X t =1 X T − t θ t k 2 2 + λ d X t =1 Ψ t p X j =1 | θ t j | w t j (2.7) Ψ 1 = 1 , Ψ t = M I {k A ( t − 1) k 0
0 , p = p ( n ) = O ( n a ) and | pa i | = O(n b ) , wher e sn 2 b − 1 log n = o (1) as n → ∞ . Mor e over, supp ose that ther e exists ν > 0 such that for al l n ∈ N and al l i ∈ V , V ar X T i | X T − d : T − 1 1: p ≥ ν and ther e exists δ > 0 and some ξ > b such that for every i ∈ V and for every j ∈ pa i , | π ij | ≥ δ n − (1 − ξ ) / 2 , wher e π ij is the p artial c orr elation b etwe en X i and X j after r emoving the effe ct of the r emaining variables. Assume that λ dn − (1 − ζ ) / 2 for some b < ζ < ξ and d > 0 , and the initial weights ar e found using lasso estimates with a p enalty p ar ameter λ 0 that satisfies λ 0 = O ( p log p/n ) . Also, for some lar ge p ositive numb er g , let Ψ t = g exp ( nI {k A ( t − 1) k 0 < p 2 β / ( T − t ) } ) (i.e. M = g e n ). Then if true c ausal effe cts diminish over time, 10 (i) With pr ob ability c onver ging to 1, no additional Gr anger-c ausal effe cts ar e include d in the mo del and the signs of such effe cts ar e c orr e ctly estimate d. (ii) With pr ob ability asymptotic al ly lar ger than 1 − β , true Gr anger-c ausal effe cts and the or der of the V AR mo del ar e c orr e ctly determine d. Pr o of. If β = 0, inclusion of the true causal effect, exclusion of incorrect effects and consistency of signs of effects follow from Theorem 3 of Sho jaie and Michailidis ( 2010b ). Since β has no effect on the probability of false p ositive, this prov es (i). F or any giv en β > 0, suppose t 0 is the smallest t for whic h k A ( t − 1) k 0 < p 2 β / ( T − t ). Then for t < t 0 Ψ t = 1 and has no effect on the estimate. Let t ≥ t 0 . Then using the KKT conditions, a co efficient is included in the w eighted lasso estimate only if | 2 n − 1 ( X t j ) T ( X T i − X t θ t ) | > Ψ t λw t j . How ev er, ( X t j ) T ( X T i − X t θ t ) is sto c hastically smaller than ( X t j ) T X T i , which is in turn a p olynomial function of n . On the other hand, λ and w t j are also p olynomial functions of n , whereas Ψ t increases exp onentially as n → ∞ . Hence, for all j = 1 , . . . , p and t ≥ t 0 , there exists an n such that | 2 n − 1 ( X t j ) T ( X T i − X t θ t ) | < Ψ t λw t j and therefore, A t = 0 , t ≥ t 0 . How ever, since the num ber of true causal effects diminish ov er time, the total num b er of true edges in time lags t ≥ t 0 is less than β . This prov es the first part of (ii). Finally , to prov e that the order of V AR is correctly estimated, i.e. d = t 0 − 1, we consider tw o com- plemen tary even ts: d < t 0 − 1 and d > t 0 − 1. Prior to t 0 , false p ositives o ccur with exp onentially small probabilit y , hence, the probabilit y that d < t 0 − 1, is negligible. On the other hand, d > t 0 − 1 only if true edges are not included in ˆ A t 0 and as a result k ˆ A ( t 0 − 1) k 0 < p 2 β / ( T − t 0 ). But false negativ es o ccur if true edges v anish in the adaptive lasso estimate. Ho wev er, adaptive lasso finds the true edges with exponentially large probabilit y , hence, P ( d < t 0 − 1) ≥ 1 − β − O (exp( − cn d )) for constants c and d . This completes the pro of. References Arnold, A., Liu, Y., and Ab e, N. (2007). T emporal causal mo deling with graphical granger metho ds. In Pr o c e e dings of the 13th ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , pages 66–75. ACM New Y ork, NY, USA. de Leeu w, J. (1994). Blo ck-relaxation algorithms in statistics. In Information System and Data Analysis , pages 308–325. F riedman, J., Hastie, T., and Tibshirani, R. (2008). Regularization Paths for Generalized Linear Mo dels via Co ordinate Descen t. Dep artment of Statistics, Stanfor d University, T e ch. R ep . F ujita, A., Sato, J., Gara y-Malpartida, H., Y amaguchi, R., Miyano, S., Sogay ar, M., and F erreira, C. (2007). Mo deling gene expression regulatory netw orks with the sparse vector autoregressiv e mo del. BMC Systems Biolo gy , 1 (1), 39. Granger, C. (1969). Inv estigating causal relations b y econometric mo dels and cross-spectral methods. Ec ono- metric a , page 424. Kao, K., Y ang, Y., Boscolo, R., Sabatti, C., Royc ho wdhury , V., and Liao, J. (2004). T ranscriptome-based determination of multiple transcription regulator activities in Esc herichia coli b y using net work comp onent analysis. Pr o c e e dings of the National A c ademy of Scienc es , 101 (2), 641–646. Lozano, A., Abe, N., Liu, Y., and Rosset, S. (2009). Grouped graphical Granger modeling for gene expression regulatory net works discov ery. Bioinformatics , 25 (12), i110. L ¨ utk ep ohl, H. (2005). New intr o duction to multiple time series analysis . Springer. Mukhopadh ya y , N. and Chatterjee, S. (2007). Causality and path wa y searc h in microarray time series exp erimen t. Bioinformatics , 23 (4), 442. 11 Murph y , K. (2002). Dynamic Bayesian networks: r epr esentation, infer enc e and le arning . Ph.D. thesis, Univ ersity Of California. Ong, I., Glasner, J., P age, D., et al. (2002). Mo delling regulatory pathw ays in E. coli from time series expression profiles. Bioinformatics , 18 (Suppl 1), S241–S248. Opgen-Rhein, R. and Strimmer, K. (2007). Learning causal net works from systems biology time course data: an effective model selection pro cedure for the v ector autoregressiv e pro cess. BMC bioinformatics , 8 (Suppl 2), S3. P earl, J. (2000). Causality: Mo dels, R e asoning, and Infer enc e . Cambridge Univ Press. P errin, B., Ralaiv ola, L., Mazurie, A., Bottani, S., Mallet, J., and d’Alche Buc, F. (2003). Gene net w orks inference using dynamic Bay esian net works. Bioinformatics , 19 (90002), 138–148. Sam b o, F., Di Camillo, B., and T offolo, G. (2008). CNET: an algorithm for rev erse engineering of causal gene net works. In NETT AB2008. V ar enna, Italy . Sho jaie, A. and Michailidis, G. (2009). Analysis of Gene Sets Based on the Underlying Regulatory Netw ork. Journal of Computational Biolo gy , 16 (3), 407–426. Sho jaie, A. and Michailidis, G. (2010a). Netw ork Enrichmen t Analysis in Complex Experiments. Stat. App. in Genetics and Mol. Biolo gy , 9 (1), Article 22. Sho jaie, A. and Michailidis, G. (2010b). Penalized Likelihoo d Metho ds for Estimation of sparse high dimen- sional directed acyclic graphs. Biometrika (in press). Tseng, P . (2001). Conv ergence of a block co ordinate descent metho d for nondifferentiable minimization. Journal of optimization the ory and applic ations , 109 (3), 475–494. Whitfield, M., Sherlock, G., Saldanha, A., Murray , J., Ball, C., Alexander, K., Matese, J., Perou, C., Hurt, M., Brown , P ., et al. (2002). Iden tification of genes p eriodically expressed in the human cell cycle and their expression in tumors. Mole cular Biolo gy of the Cel l , 13 (6), 1977. Y amaguchi, R., Y oshida, R., Imoto, S., Higuchi, T., and Miyano, S. (2007). Finding mo dule-based gene net works with state-space mo dels-Mining high-dimensional and short time-course gene expression data. IEEE Signal Pr o c essing Magazine , 24 (1), 37–46. Zou, C. and F eng, J. (2009). Granger causality vs. dynamic Ba yesian netw ork inference: a comparative study. BMC bioinformatics , 10 (1), 122. Zou, H. (2006). The Adaptive Lasso and Its Oracle Prop erties. J of the A meric an Statistic al Asso ciation , 101 (476), 1418–1429. 12
Comments & Academic Discussion
Loading comments...
Leave a Comment