Estimating cellular pathways from an ensemble of heterogeneous data sources

Estimating cellular path w a ys from an ensem ble of heterogeneous data sources Alexander M. F ranks † , Florian Mark ow etz ‡ and Edoardo Airoldi † † Departmen t of Statistics, Harv ard Univ ersit y ‡ Cancer Researc h UK Cambridge Institute This w ork w as supp orted, in part, b y NIH gran t R01 GM-096193, NSF CAREER gran t I IS-1149662, and b y MURI aw ard W911NF-11-1-0036 to Harv ard Universit y . EMA is an Alfred P . Sloan Researc h F ellow. Address correspondence to: airoldi@fas.harvard.edu . Abstract Building b etter mo dels of cellular pathw ays is one of the ma jor c hallenges of systems biology and functional genomics. There is a need for methods to build on established exp ert kno wledge and reconcile it with results of high-throughput studies. Moreov er, the av ailable data sources are heterogeneous and need to b e comb ined in a wa y sp eciﬁc for the part of the pathw a y in whic h they are most informative. Here, w e presen t a compartment sp eciﬁc strategy to integrate edge, no de and path data for the reﬁnement of a netw ork h yp othesis. Sp eciﬁcally , w e use a local-mov e Gibbs sampler for reﬁning pathw a y hypotheses from a comp endium of heterogeneous data sources, including no vel metho dology for in tegrating protein attributes. W e demonstrate the utilit y of this approac h in a case study of the pheromone resp onse MAPK pathw ay in the yeast S. cerevisiae. 2 Con ten ts 1 In tro duction 1 2 In tegating high-dimensional resp onses of a cellular path wa y 4 2.1 A compartment map deﬁnes con text-sp eciﬁc data contributions 5 2.2 Mo deling high-dimensional data for nodes, edges and paths . . 6 3 Analysis of the pheromone resp onse path wa y in S. c er evisiae 10 3.1 Exploratory data analysis of individual data types . . . . . . . 12 3.2 V alidation of the in tegrated analysis . . . . . . . . . . . . . . . 16 3.3 Inferring cross-talk with other path w a ys . . . . . . . . . . . . 18 3.4 P erformance assessmen t on simulated data . . . . . . . . . . . 19 4 Discussion 20 4.1 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . 22 A Supplemen tary results 22 3 1 In tro duction Cellular mec hanisms are driv en b y interactions b et ween DNA, RNA, and proteins working together in cellular pathw a ys. How ever, the current knowl- edge of information ﬂow in the cell is still v ery incomplete ( Kirouac et al. , 2012 ). Ev en in well established signaling pathw ays studied for decades in mo del organisms, newer approac hes can disco v er nov el comp onents ( M ¨ uller et al. , 2005 ) or cross-talk with other path w ays ( McClean et al. , 2007 ). In can- cer, ﬁnding pathw a ys underlying disease developmen t can lead to new drug targets ( Balbin et al. , 2013 ). This makes the dissection of cellular path wa ys one of the ma jor c hallenges of systems biology and functional genomics. One of the main obstacles to utilize high-throughput data in reﬁning kno wn pathw ay mo dels is the gap b etw een the relatively un biased and h yp othesis- free nature of generating genome-scale datasets and the need for very fo cused, h yp othesis-driv en research to test biological models in small or medium scale exp erimen ts ( Hibbs et al. , 2008 ). While researc hers in computational biology usually start with a collection of data and reconstruct path wa ys from it, ex- p erimen tal biologists often start with a sp eciﬁc pathw a y h yp othesis in mind and try to reconcile it with the evidence from high-throughput screens. Here, w e con tribute to bridging this gap b y introducing a comprehensive data in tegration strategy to reﬁne a giv en path w ay hypothesis. Our approac h is c haracterized by three key features: First, we start with a sp e ciﬁc p athway mo del and assess how w ell it is supp orted in a collection of complemen tary data sets. These data sets are heterogeneous and informativ e for distinct cellular locations. Second, w e exploit this fact by introducing a c omp artment- sp e ciﬁc probabilistic model, where data t yp es are only used for reconstructing the parts of a path wa y they are informative ab out. Third, we explicitly include no de pr op erties in our mo del. This allows us to use data like protein phosphorylation states or protein domains, which hav e so far b een under- utilized for path wa y structure learning ( Ryan et al. , 2013 ). In this pap er w e show that our mo deling approach can assist exp erimen- 1 talists in planning future studies by assessing which parts of a biological mo del are not w ell supp orted b y data, and by prop osing testable extensions and reﬁnemen ts of a given pathw a y hypotheses. W e demonstrate the p ow er of our approac h in a case study in the yeast S. c er evisiae . R elate d work. P athw ay reconstruction is a well established ﬁeld in compu- tational biology ( Hyduke and Palsson , 2010 ; Marko wetz and Spang , 2007 ). Sev eral features distinguish our path w ay reﬁnemen t metho dology from ex- isting netw ork reconstruction metho ds. Comprehensiv e data integration strategies on large data collections w ere sho wn to b e very successful in predicting protein function and in teractions ( Guan et al. , 2012 ; Llew ellyn and Eisen b erg , 2008 ; Guan et al. , 2008 ; My- ers et al. , 2005 ). These metho ds are very helpful for describing the global landscap e of protein function, but oﬀer less insight in to individual molec- ular mechanisms and pathw a ys. Our approach diﬀers from metho ds to re- ﬁne pathw a y hypotheses from expression proﬁles of down-stream regulated genes ( Gat-Viks and Shamir , 2007 ), because we in tegrate heterogeneous data sources in a compartment-speciﬁc w ay . W e also diﬀer from previous research on de-nov o pathw a y reconstruction. These metho ds can b e classiﬁed by how they use information ab out edges, paths and no des in the pathw a y diagram for structure learning. • Most approaches incorp orate evidence for individual e dges in the path- w ay diagram using phenot ypic proﬁles ( Mulder et al. , 2012 ; W ang et al. , 2012 ) or gene expression measurements ( Li et al. , 2013 ; Balbin et al. , 2013 ; Sc h¨ afer and Strimmer , 2005a ; F riedman , 2004 ; Segal et al. , 2003 ), some- times supplemented by additional data sources lik e transcription factor binding data ( Bernard and Hartemink , 2005 ; W erhli and Husmeier , 2007 ) or protein-protein interactions ( Gitter et al. , 2013 ; Nariai et al. , 2004 ; Segal et al. , 2003 ). Other studies completely rely on protein-protein in teractions to predict path wa ys ( Mazza et al. , 2013 ; Scott et al. , 2006 ). • Cause-eﬀect relationships indicating p aths from p erturb ed genes to ob- 2 serv ed eﬀects are exploited in metho ds lik e SPINE ( Ourfali et al. , 2007 ), ph ysical net w ork mo dels ( Y eang et al. , 2005 ), nested eﬀects mo dels ( W ang et al. , 2013 ; Mark ow etz et al. , 2007 ; T resch and Mark o w etz , 2008 ; F r¨ ohlic h et al. , 2007 , 2008 ) and others ( Lo et al. , 2012 ; Yip et al. , 2010 ), with appli- cations including DNA damage repair ( W orkman et al. , 2006 ) and cancer signalling ( Knapp and Kaderali , 2013 ; Stelniec-Klotz et al. , 2012 ). • No de information , i.e. features of individual proteins or genes, has b een found useful for assigning proteins to pathw a ys ( Hahne et al. , 2008 ; F r¨ ohlic h et al. , 2008 ) but has so far b een under-utilized in reconstructing path wa y structure ( Ryan et al. , 2013 ). Our metho d diﬀers from de-nov o pathw a y reconstruction in that we start with a h yp othesis pathw a y and identify whic h hypothesized edges are sup- p orted b y the data. W e also diﬀer from other metho ds whic h ev aluate formal one and tw o sample net work hypothesis tests ( Y ates and Mukhopadh ya y , 2013 ). Our goal is not to explicitly to determine whether our initial hy- p othesis is “correct”– on the con trary w e assume a priori that an y initial h yp othesis can b e further reﬁned and impro v ed up on. In the spirit of FDR, w e pro vide a list of edge probabilities that can assist experimentalists in their future studies. W e assess which parts of an existing biological mo del are not w ell supp orted b y a data as w ell as suggesting new edges whic h are supp orted b y the data but which are not part of the original h yp othesis. F urther, we are the ﬁrst to integrate data about e dges and p aths as well as no des in the path wa y diagram. Overview. W e describ e a compartmen t-sp eciﬁc probabilistic graphical mo del for p osterior inference on cellular pathw a ys in se ction 2 , whic h can extend and reﬁne a giv en biological mo del and predict nov el parts of the path wa y graph. Our mo del comprehensively in tegrates the three general t yp es of data on edges, paths, and no des. W e demonstrate the utility of our metho ds in a case study in S. Cer evisae ( se ction 3 ) by ﬁrst exploring the in- formation conten t in diﬀeren t data sources individually ( se ction 3.1 ) and then 3 ev aluating results of p osterior draws using b oth full data and leav e-one-out data ( se ction 3.2 ). 2 In tegating high-dimensional resp onses of a cellular path w a y Giv en a set of a gene products, i.e., putativ e path wa y mem b ers, w e infer an undirected netw ork mo del using a lo cal-mov e Gibbs sampler. The pathw a y mo del, is deﬁned in terms of N no des and the edges b et ween these pairs of no des, ( n, m ). The edges are encoded b y a binary random v ariable, X nm . The collection of edge-sp eciﬁc random v ariables deﬁnes the adjacency matrix, X , of the pathw a y mo del. Par ameter estimation and p osterior infer enc e. The adjacency matrix X corresp onding to the pathw a y model is latent since we cannot directly ob- serv e the edges. Thus, the primary goal of our analysis is to do p osterior inference on the adjacency matrix, X , from a collection of M data sets, Y 1: M . Although we treat X as laten t, we diﬀer from de-no vo path wa y reconstruc- tion b y incorportaing an informativ e hypothesis pathw ay which we use to train the mo dels for data sets Y 1: M (see Section 3 ). By Bay es rule, the p osterior distribution on a path w ay mo del, P ( X | Y 1: M , Θ) ∝ P ( X | Θ) · P ( Y 1: M | X , Θ) , (2.1) is prop ortional to the prior distribution on the pathw ay with the lik eliho o d of the data. Here, Θ is a collection of prior parameters introduced b elo w. W e use a lo cal Gibbs sampling strategy to sample path wa y models from p osterior distribution in Equation 2.1 . The sampler explores the space of path wa y mo dels b y adding or remo ving edges in turn, one at a time. Sp ecif- ically , the edge X nm b et w een gene pro ducts ( n, m ) is sampled according to a Bernoulli distribution, with probability of success 4 P ( X nm | X ( − nm ) , Y 1: M , Θ) , (2.2) where X ( − nm ) represen ts the set of edges without X nm . 2.1 A compartment map deﬁnes con text-sp eciﬁc data con tributions W e use ﬁve complemen tary data types: ph ysical binding of protein pairs (in- cluding y east-t wo hybrid, mass sp ectrometry , and literature-curated data), transcription factor-DNA binding assa ys, gene kno ck out data, gene co-expression data, and no de information (including protein domains and diﬀeren tial phos- phorylation arra ys) Imp ortan tly , diﬀerent data sets can b e v ery informativ e in sp eciﬁc cellular lo cations while completely uninformative in others. Th us, b efore we deﬁne the data lik eliho o ds in section 2.2 , it is essential to exploit this fact in our mo del. W e translate exp ected compartment lo calization of a pair of gene pro ducts ( n, m ) into a binary imp ortance vector ~ b nm , which driv es the inference pro cess b y selecting the most informative data types for the compartments in volv ed. T o instan tiate the notion that diﬀeren t data are informative in diﬀerent cellular lo cations, w e introduce an additional mo deling elemen t: the compart- men t map, which contains three conceptual path wa y compartments directly based on the organisation of the cell: First, the c el l membr ane , where recep- tor proteins sense signals from outside the cell; second, the cytoplasm , where protein cascades relay these signals to transcription factor proteins that en- ter the third compartment, the nucleus , to regulate the activit y of target genes. The compartmen t map, C , is a 5 × 3 binary matrix that asso ciates the three pathw a y compartments with the ﬁve data types to indicate which data t yp e is informative about molecular interactions in whic h compartments (see T able 1). In particular, eac h data set is describ ed b y a pair ( Y i , T i ), where Y i denotes 5 the collection of measurements, and T i is ﬁve-lev el factor that denotes the sp eciﬁc data type used to collect the data and indexes the relev ant row of C . W e can no w revise the form of the conditional distributions in Equation 2.2 , P ( X nm | X ( − nm ) , Y 1: M , T 1: M , C , Θ) = (2.3) = L ( X nm = 1 , X ( − nm ) | Y 1: M , Θ) L ( X nm = 1 , X ( − nm ) | Y 1: M , Θ) + L ( X nm = 0 , X ( − nm ) | Y 1: M , Θ) Ov erloading notation, we let C t ( n, m ) b e an indicator reﬂecting whether the protein pair ( n, m ) is informativ e for data type t, based on the com- partmen t map and the lo calizations of proteins n and m. This leads to the follo wing lik eliho o d sp eciﬁcation: L ( X nm , X ( − nm ) | Y 1: M , Θ) ∝ (2.4) = M Y k  P ( Y k | X nm , X ( − nm ) , T k = t, Θ) C t ( n,m ) × P ( Y k | X ( − nm ) , T k = t, Θ) 1 −C t ( n,m )  (2.5) where the role of the indicator is to discard data collections from data types that are expected to carry little information ab out the protein pair of in terest, according to information in C . That is, for any pair ( n, m ), C t ( n, m ) = 0 implies data set Y k is conditionally indep endent of ( n, m ) given the rest of the pathw a y . In this case, the data in Y k has no eﬀect on the conditional p osterior probability of X nm . 2.2 Mo deling high-dimensional data for no des, edges and paths Data of diﬀeren t t yp es need to b e mo deled diﬀerently . W e fo cus on mo deling ﬁv e main data types: protein interaction data, protein-DNA binding data, gene co-expression data, gene p erturbation data, and no de attribute data 6 (diﬀeren tial phosphorylation and protein domains). Belo w, we describ e the lik eliho o d functions corresp onding to the main data t yp es of in terest. Likeliho o d for pr otein inter action data. Here, w e consider a single data set Y N × N obtained with data t yp e T aimed at measuring physical protein binding ev ents (PPI). W e reduce the likelihoo d of the data, Y , to a function the false p ositive and false negativ e rates, α and β . Given the path wa y , X , w e ev aluate L ppi ( Y | X , α , β ) = α S 10 (1 − α ) S 11 β S 01 (1 − β ) S 00 , (2.6) where S xy coun ts the n umber of edges for which X nm = x and Y nm = y . F or instance, S 10 is the n umber of false p ositiv es. Likeliho o d for pr otein-DNA binding data. Here, w e consider a single data set Y N × K obtained with data t yp e T aimed at measuring transcription factor- DNA binding even ts (TF). Rather than hybridization levels (for ChIP-chip) or p eaks (for ChIP-seq), we mo del the p -v alues corresp onding to binding ev ents, which mak es our mo del indep enden t of the technology used to detect the binding ev en t. W e develop a mixture mo del for the p -v alues, directly . Giv en the pathw a y , X , we exp ect to see a small p -v alue for protein n binding n ucleotide sequence m whenev er the edge X nm is presen t. On the contrary , the p -v alues are uniformly distributed under the null hypothesis of no binding ev ents, X nm = 0. W e ev aluate L tf ( Y | X , γ ) = Y n,m  Uniform ( Y nm ) · 1 ( X nm = 0) + Beta ( Y nm | γ , 1) · 1 ( X nm = 1)  , (2.7) where 0 < Y nm < 1 ( p -v alue), and 0 < γ < 1. See a related b eta-uniform mixture mo del in tro duced b y Pounds and Morris ( 2003 ) in the con text of m ultiple testing for diﬀeren tial expression. Likeliho o d for kno ck-out data. Here w e consider a data set Y M × N , where 7 Y mn is the log-tw o-fold c hange in expression of gene n, when gene m is knock ed out. Let Z mn b e a binary v ariable representing the existence of a directed path from gene n to gene m, thr ough a tr anscription factor . While w e consider the set of undirected path wa y mo dels, w e temp orarily impute directionalit y using the fact that the cellular signal should ﬂow from the cytoplasm to the n ucleus. W e mo del the kno c kout data as a mixture of normals: L ko ( Y | X , σ 0 , σ 1 ) = (2.8) = Y n,m Normal ( Y | 0 , σ 1 ) 1 [ Z mn ] + Normal ( Y | 0 , σ 0 ) 1 (1 − Z mn ) The standard deviations for change in expression are represen ted b y σ 0 (when there is no path b etw een the kno ck out and a target) and σ 1 (there is a path). The assumption is that σ 1 > σ 0 since w e exp ect a larger c hange in expression of n for kno ck out m when n and m are connected in the pathw a y . Likeliho o d for gene c o-expr ession data. Here, we consider a single data set Y N × N aimed at measuring gene expression. Rather than hybridization lev els (for microarra ys) or the num b er of reads (for mRNA sequencing), we mo del correlations among the proﬁles of pairs of genes, whic h again makes our mo del indep enden t of the details of the measurement technology . W e dev elop a mixture mo del for the correlations, directly . Given the pathw ay , X , we exp ect to see correlation b et ween the expression proﬁles of tw o genes whenev er they are co-regulated. Similarly to Sch¨ afer and Strimmer ( 2005b ), w e use a mixture model for the distribution of the sample correlation co eﬃ- cien t ˆ ρ = y of the form L expr ( Y | X , δ, κ ) = Y n 0 #! Here, logit( P nk ) is linearly related to the presence of domains in neigh b oring genes. In b oth the normal and logistic regression cases, w e ﬁt the regression co eﬃcien ts, ~ λ , using our initial path w ay h yp othesis. In the logistic mo del, w e use a w eakly-informativ e Cauch y prior for the co eﬃcien ts ( Gelman , 2008 ). This controls for any o verﬁtting and separation problems that ma y occur. Prior distribution on the sp ac e of p athway mo dels. In this study our fo cus lies on assessing the exten t to which the data supp ort a path wa y mo del X . W e choose a blo ck mo del prior P ( X ) o v er binary matrices of size N × N with edge density ﬁxed by compartment. In general, any informative prior distribution on graphs could b e used here to enco de biological knowledge ( Isci et al. , 2013 ; Mukherjee and Sp eed , 2008 ). 3 Analysis of the pheromone resp onse path- w a y in S. c er evisiae T o demonstrate the eﬃcacy of our approach, w e examine the pheromone resp onse MAPK pathw ay in the y east S. c er evisiae . It oﬀers the opp ortunity to combine a large collection of datasets with a solid understanding of the path wa y structure. The pheromone path wa y is the sub ject of intense researc h eﬀorts in computational biology as well as exp erimen tal biology ( Hara et al. , 2012 ; Scott et al. , 2006 ; Kofahl and Klipp , 2004 ) and sho ws cross-talk to other MAPK pathw ays ( Nagiec and Dohlman , 2012 ; McClean et al. , 2007 ; Gat-Viks and Shamir , 2007 ). Initial p athway c onstruction. T o start our analysis in a wa y relev an t to reﬁning and extending existing kno wledge of signaling path wa ys, we ex- 10 3. Co-expression between TFs and their targets GPA1 STE18 STE4 CDC24 CDC42 BNI1 BEM1 STE20 STE11 STE5 STE7 MSG5 FUS3 STE12 MCM1 FAR1 DIG2 DIG1 STE2 STE3 STE2 STE3 MFA1 MFA2 MF α 1 MF α 2 STE12 MCM1 AFR1 CHS1 FIG2 FIG1 FUS1 CIK1 KAR5 GIC2 MFA1 MFA2 SWI4 SWI5 ALK1 CDC20 AGA1 FAR1 Membrane Cyt oplasm Nucleus B. Edge data C. Path da ta Roberts et al , 2000 D . Node data A. Compartment-specic Hypothesis (with posterior probabilities) −5 0 5 log2 fold−change                                                                                                                                                                                                                                                                                                                                                                                                                                                         Membrane Cytoplasm TFs Nucleus FAR1 BNI1 BNI1 STE12 STE20 STE2 FUS3 BNI1 FUS3 FUS3 FAR1 BNI1 BNI1 STE12 STE20 STE2 FUS3 BNI1 FUS3 FUS3 FAR1 BNI1 BNI1 STE12 STE20 STE2 FUS3 BNI1 FUS3 FUS3 FAR1 BNI1 BNI1 STE12 STE20 STE2 FUS3 BNI1 FUS3 FUS3 TFs 22 6 23 159 28 182 45 165 1 0 1 0 Hypothesis cytoplasm 1. PPI data Reguly et al , 2006 4. TF-DNA binding data Nucleus 0 1 0 1 T rue positive rate F alse positive rate 2. Co-expression data predicting pathway structure Nucleus 0 1 F alse positive rate 0 1 T rue positive rate 0 3 log-ratio pheromone treated/untreated Membrane Cytoplasm TFs Nucleus STE12 STE2 APR1 1. Dierential phosphorylation Gruhler et al , 2005 -3 Complementary Data Sets 0 1 -1 STE12 MCM1 targets only TF to targets targets only TF to targets p<0.05 p<0.025 p<0.005 S_TKc RING Small GTPase PX 2. Protein domains overrepresenta tion PBD DSPc S_TKc RhoGEF PBD PKinase_Tyr PKinase_Tyr SMART P FA M Complete Cytoplasm PX 22.5 0.478 SH3 1 22.5 0.478 SH3 2 22.5 0.478 DSPC 22.5 0.434 PBD 26.6 0.001 PKINASE TYR 28.9 0.001 PKINASE 28.9 0.004 3. Multivariate relational reg ression PF AM domains SMART domains RING 21.7 0.658 DSPC 21.7 0.524 RHOGEF 21.7 0.515 PX 24.5 0.149 SMALL GTP ASE 24.5 0.138 PBD 26.8 0.015 S TKC 30.1 0.004 p(G) AIC Domain p(G) AIC Domain 1.00 1.00 1.00 0.00 .89 .23 .49 0.00 .97 .41 .39 0.01 FAR1 0.03 0.03 0.00 .85 .63 .47 0.00 0.00 0.97 STE2 .82 0.01 0.00 1.00 0.17 1.00 .84 0.00 0.00 0.00 0.15 .18 .21 .88 .02 .73 .91 .38 0.00 .82 .81 .85 .82 AGA1 .82 0.00 .86 .80 .83 0.00 0.00 .46 .84 Figure 1: Compartmen t-sp eciﬁc path wa y h yp othesis, p osterior probabilities, and ev aluation of supp ort in the data. A. P athw ay hypothesis and p osterior edge probabilities for the Y east pheromone resp onse path wa y . The n umbers by each edge reﬂect the “p osterior probability” . B. Edge data: (1) protein- protein in teractions in the cytoplasm, (2) gene co-expression in the nucleus, (3) co-expression of TFs with their targets and b etw een targets for STE12 and MCM1, and (4) TF binding data. C. Cause-eﬀect data. D. No de data: (1) Diﬀerential phosphorylation, (2) Overrepresen tation of protein-domains in diﬀerent compartmen ts, (3) go o dness-of-ﬁt of auto-logistic mo dels on protein domains from PF AM and SMAR T. 11 tracted a mo del of the pheromone resp onse pathw ay from the summary of MAPK path w ays (sce04010) in the database KEGG ( Kanehisa and Goto , 2000 ) and com bined it with known transcription factor (TF) targets from t wo indep endent studies ( Simon et al. , 2001 ; Ren et al. , 2000 ). W e split the path wa y into three parts: the membr ane compartmen t con- taining the receptor proteins, the cytoplasm compartment containing the MAPK cascade to activ ate the transcription factors (TF), and the nucle ar compartmen t containing the TFs and their targets. Figure 1 -A depicts the path wa y h yp othesis. Proteins mediating b etw een t w o compartments (like TFs) are contained in t wo sub-graphs and mark ed by grey boxes. TF targets that are also members of other compartments are indicated in b old. 3.1 Exploratory data analysis of individual data t yp es Before inferring the full mo del from all data, we explored the information con tent in each t yp e of data individually . Pr otein-pr otein inter actions (PPI). W e compared data from several com- plemen tary high-throughput assays, all av ailable from BioGRID ( Stark et al. , 2006 ) as well as a literature-curated dataset ( Reguly et al. , 2006 ). W e ana- lyzed the o v erlap b et ween the protein interactions and the path w a y hypoth- esis of Fig 1 -A. None of the datasets are informative for the mem brane and n uclear compartmen ts. Surprisingly , in the cytoplasm compartmen t w e found that all of the high-throughput datasets show only ≤ 3 interactions b etw een an y of the proteins in the pathw ay . The situation was very diﬀerent for the literature-curated data. Here, 45 interactions in the cytoplasm compartment co vered 22 out of the 28 edges there (sensitivit y > 78%, sp eciﬁcit y > 87%, see Fig 1 -B1). TF-DNA binding data. W e used the transcription factor binding data of ( Harbison et al. , 2004 ), whic h is indep enden t of our deﬁnition of TF targets in the path wa ys h yp othesis. The R OC in Figure 1 -B4 sho ws v ery clear signal to 12 distinguish the targets posited in the biological model from all other path w a y genes. Co-expr ession data. F or gene expression data, we examined datasets in whic h the path wa y genes show ed a signiﬁcan t diﬀerence in correlation struc- ture from all other y east genes (using the SPELL algorithm of ( Hibbs et al. , 2007 )) resulting in 20 datasets from 15 publications (including Roberts et al. , 2000 ; Gasc h et al. , 2000 ; Brem and Kruglyak , 2005 ). Figure 1 -B2 shows ROCs for predicting edges in the nuclear compartmen t for all datasets (grey lines) and the concatenated data (black line). No curv e impro ves muc h on random prediction (the main diagonal). The reason is biological: Because expression data are a p o or surogate for protein activit y , TFs are often less well corre- lated to their targets than the targets are b et ween each other (Figure 1 -B3). F or STE12, which regulates itself, all correlation co eﬃcien ts exhibit a strong trend tow ards high p ositive correlation. Whereas MCM1, which is not self- regulating, is far less strongly correlated to its targets than the targets are b et w een each other. Th us, in general it is more informativ e to use the cor- relation b et ween targets for inference, whic h is consisten tly high whether or not a TF is transcriptionally regulated itself. Gene p erturb ation data. P aths in the graph are visible in cause-eﬀect datasets ( Hughes et al. , 2000 ; Rob erts et al. , 2000 ). W e ﬁnd only very small eﬀects of p erturbations in the path wa y on the expression of mem b ers of the mem brane and cytoplasm compartment including TFs. Figure 1 -C summa- rizes this result for the Rob erts et al. ( 2000 ) data. V ery similar results w ere found for the Hughes et al. ( 2000 ) data. The four b oxes corresp ond to the three compartments plus TFs. In eac h b ox, a vertical line corresp onds to a p erturbation in the path wa y (some replicated). The dots sho w the fold- c hanges of the pathw ay genes in this compartment. Only in the nuclear compartmen t are wide-spread large fold-c hanges visible. This observ ation motiv ates the construction of our likelihoo d around the presence of paths b et w een the kno ck out and genes in the n uclear compartment (see section 2 ). 13 In this w ay , when the kno ck out is far enough upstream, there is information ab out edges in the cytoplasm as w ell, ev en if the proteins there sho w no eﬀect on the transcriptional level. Pr otein phosphorylation. A ﬁrst example of no de information is protein phosphorylation. The study of Gruhler et al. ( 2005 ) assessed diﬀeren tial phosphorylation of proteins in resp onse to pheromone. Figure 1 -D1 sho ws the log-ratios b et ween the pheromone treated and untreated conditions. Almost all proteins of the pheromone pathw ay measured b y Gruhler et al. ( 2005 ) are up-regulated, whic h makes sense for a kinase cascade. The phosphorylation w e observe for proteins corresp onding to genes only attributed to the nuclear compartmen t in our mo del must b e due to other kinase path wa ys in the cell. W e further assessed to what extent the diﬀerential phosphorylation is cor- related with the path wa y mo del by ﬁtting an auto-logistic regression. As a measure of correlation w e computed the v ariance explained, R 2 = 0 . 76, using the b o otstrap . The v ariance explained by the auto-logistic regression w as found statistically signiﬁcan t, when compared to the correlation of diﬀeren- tial phosphorylation with randomized pathw ay mo dels, p ≈ 0 . 062, and with randomized protein p erm utations on the true path w ay mo del, p ≈ 0 . 059. Pr otein domains. A second example of no de information are protein do- mains. W e retrieved protein domains from PF AM ( Pun ta et al. , 2012 ) and SMAR T ( Letunic et al. , 2012 ). First, w e sough t to quantify which domains, if an y , were ov er-represented in the set of proteins in v olved in the complete pheromone resp onse path wa y as w ell as in eac h compartmen t, in turn. Fig- ure 1 -D2 lists the domains that w ere found to b e o ver-represen ted in the complete path w ay and in the cytoplasm; dark er shades of gra y indicate a more signiﬁcant p-v alue for the o ver-represen tation test. Second, we sough t to quan tify to what exten t the presence or absence of sp eciﬁc protein domains in proteins in teracting with a giv en protein, P , was informativ e ab out the presence or absence of the same domain in such protein, P . This analysis w as carried out using auto-logistic mo dels, which summa- 14 rize the informativ eness of protein domains b etw een interacting proteins on a verage, across all proteins in a giv en path wa y . W e ﬁt auto-logistic regres- sions using each protein P in the cytoplasm compartmen t of the pheromone resp onse path wa y as data point, and the presence or absence of domains D 1: K in any one protein among those in teracting with P as co v ariates. W e ﬁt multiv ariate mo dels, which assume that the presence or absence of either the same or complemen tary domains is a factor that facilitates protein ph ysical in teractions. The t wo tables in 1 -D3 summarize the go o dness of ﬁt of the multiv ariate mo dels, and report b o otstrap p-v alues to assess the signiﬁcance of the AIC scores. Figure 1 -D3 shows the p-v alues obtained b y ﬁtting the multiv ariate auto-logistic regression to randomized pathw ay mo dels. The domains iden tiﬁed b y the multiv ariate mo dels as putatively carrying signal ab out the pheromone pathw a y in the cytoplasm o v erlap with the domains iden tiﬁed by the ov er-represen tation analysis ab ov e; namely , P21 rho-binding domains, S-TKc domains, and t yrosine-sp eciﬁc catalytic domains. In summary , no de attributes of the proteins inv olved in the pheromone resp onse path wa ys are informativ e ab out mec hanistic elements of the kinase cascade, across cellular lo calizations and in the cytoplasm. These ﬁndings suggest that in tegrating no de attributes suc h as protein domains and cellular lo calization should increase the lik eliho o d of pathw a y mo dels that encode real biological signal ab out the inner w orking of a target path wa y . Data Inte gr ation. The previous results suggest that some datasets are indeed more informative in certain cellular lo cations. F or example, protein in teractions can explain wide parts of the kinase cascade in the cytoplasm, while co-expression is v ery strong for TF targets. Ho w ever, no dataset is informativ e in all compartmen ts: Neither protein interactions nor kno c k out data can explain a complete pathw a y . The pheromone resp onse path wa y is an arc hetypical MAPK pathw a y , so we exp ect these observ ations also to be v alid for other MAPK and signaling pathw ays. These results suggest that the 15 compartmen t-sp eciﬁc mo deling approac h we tak e here is sensible. As a pro of of concept, w e use the results of exploratory data analysis to heuristically construct the compartment map, C (T able 1 ). Ultimately , we hop e to infer the compartment map in a statistically principled w ay . 3.2 V alidation of the in tegrated analysis W e ev aluated ho w w ell the joint mo del, whic h combines all the complemen- tary data types discussed ab ov e, supp orts the pathw a y hypothesis in Section 3 by sampling 1000 p ossible pathw a ys using MCMC and tabulating the p os- terior probabilities o ver the edges. The logistic regression model for domain data ma y b e sub ject to ov er- ﬁtting and separation. This can o ccur since there are many diﬀerent protein domains present, yet the frequency of any single domain is fairly lo w. T o mitigate this issue, we used a Cauc hy prior on the co eﬃcients for the suto- logistic regression, whic h is a sensible default prior for this mo del ( Gelman , 2008 ). Since the domain information in the pheromone pathw a y is relatively sparse, w e also collected protein domain data from other MAPK pathw ays and used the h yp othesized structure of those pathw ays to help learn the regression co eﬃcients. Figure 1 A includes the p osterior probabilities for the edges in our initial h yp othesis. T able 1: The compartmen t map, C , asso ciates pathw a y compartmen ts with those data types that are informative for suc h compartments. Prior informa- tion is informativ e for all compartments. Mem brane Cytoplasm Nucleus PPI 1 1 0 TF 0 0 1 Expr 0 0 1 Kout 0 1 1 No de 0 1 0 Prior 1 1 1 16 W e also used a le ave-one-out strategy to ev aluate the predictive p ow er of our mo del. W e ran 37 separate simulations where eac h no de w as in turn left out of the training path wa y . The edges connected to this no de w ere propagated to the neigh b oring no des of the left-out no de. W e left out the no des rather than edges, because sp eciﬁcally lea ving out edges is equiv alent to assuming that w e kno w there is no edge present. W e needed to construct our mo del in a w a y that enco des ignorance ab out the presence of an edge. Lea ving out the no des, instead of the edges, is one wa y of b eing agnostic ab out the presence of edges attached to that no de. Only the co eﬃcien ts in the auto-logistic regression w ere learned from the path w a y h yp othesis, so only the no de likelihoo ds w ere aﬀected. T able 2 sho ws the posterior probabilities for edges (under sim ulations in which a no de was remov ed from the prior h yp othesis path wa y). This table presents posterior probabilities for edges T able 2: P osterior edge probabilities for leav e-one-out trials in volving edges in kno ck out exp erimen ts. Since w e use a lea ve-node-out scheme, there are t wo p osterior probabilities for an edge (corresp onding to which of the tw o no de endp oin ts w ere left out for that particular sim ulation). Real data In Silico Min Average Max Min Average Max STE11/STE7 0.01 0.01 0.01 0.26 .31 0.36 MCM1/STE2 0.00 0.01 0.02 0.03 0.12 0.2 MF(ALPHA)1/STE2 0.00 0.00 0.01 0.01 0.19 0.36 FUS1/STE12 0.80 0.83 0.87 0.39 0.66 0.92 CDC42/STE18 0.00 0.00 0.00 0.00 0.16 0.31 FUS3/STE12 0.01 0.01 0.01 0.01 0.10 0.19 STE5/STE7 0.13 0.13 0.13 0.00 0.14 0.27 BNI1/CDC42 0.49 0.55 0.61 0.20 0.24 0.28 F AR1/MCM1 0.00 0.00 0.00 0.24 0.26 0.27 F AR1/STE12 0.00 0.00 0.00 0.00 0.37 0.73 STE12/CHS1 0.80 0.82 0.83 0.01 0.02 0.03 STE12/FIG2 0.84 0.84 0.85 0.04 0.24 0.43 MCM1/AGA1 0.10 0.23 0.37 0.07 0.17 0.27 STE12/FIG1 0.00 0.00 0.00 0.42 0.70 0.98 STE12/CIK1 0.83 0.84 0.85 0.94 0.96 0.98 STE12/KAR5 0.83 0.83 0.84 0.23 0.30 0.37 STE12/GIC2 0.83 0.83 0.84 0.12 0.54 0.95 MCM1/SWI4 0.00 0.00 0.00 0.16 0.29 0.41 17 in volv ed in kno ck out exp eriments. Lastly , Figure 2 sho ws the precision-recall curve for our mo del, b y com- partmen t. F or the membrane compartment, only the PPI data is informativ e, and weakly so. Thus, it p erforms the most p o orly , although there are also b y far the fewest genes in this compartment. By con trast, the n uclear and cytoplasm compartments b oth hav e high precision and recall. Figure 2: Precision/Recall curves o v erall and by compartmen t for the MAPK pathw ay (left) and simulated data (righ t). In truth, the membrane compartment, which has the few est genes, p erforms po orly b ecause only the PPI dataset is (weakly) informative there. The simulated data curv e reﬂects the av erage Precision/Recall ov er 30 sim ulated datasets. 3.3 Inferring cross-talk with other path w ays With our mo del, we are also able to iden tify p ossible cross-talk betw een path wa ys. In this pap er, we fo cus on the pheromone resp onse pathw a y , but our mo del can easily b e used on other pathw ays, as long as w e sp ecify the relev ant genes and transcription factors, and their corresp onding cellular lo cations. F or instance, the MAPK path wa y consists of the pheromone sub-path w ay , 18 as wel l as hypotonic shock, osmolarit y and starv ation sub-pathw ays. The degree of in teraction b etw een comp onents of these MAPK pathw a ys is not curren tly known. T o identify cross-talk betw een the pheromone path w a y and other MAPK pathw a ys, w e can simply include a new set of genes from the other sub path wa ys and ﬁt the mo del as usual. The results for the cross-talk ev aluations are display ed in T able 3 . T able 3: Num b er of inferred edges b etw een the pheromone path wa y and one of the other three sub-path wa ys with p osterior probabilities ab ov e 0.3. osmolarit y hypotonic starv ation cytoplasm-cytoplasm 16 25 11 cytoplasm-mem brane 12 17 8 cytoplasm-n ucleus 22 17 3 cytoplasm-tf 0 2 3 mem brane-membrane 2 2 2 mem brane-nucleus 19 13 3 mem brane-tf 0 1 2 n ucleus-nucleus 4 7 0 n ucleus-tf 1 6 10 tf-tf 0 0 2 3.4 P erformance assessmen t on simulated data W e also ﬁt the mo del to in silic o data. W e constructed the “true path wa y” to matc h the hypothesized MAPK pheromone path wa y of Figure 1 A. That is, w e ﬁxed a pathw a y with the matching no des and edges. W e then generated in silico datasets from the mo dels sp eciﬁed in Section 2 . The one exception is the data generation for the no de data. Here, we generate the presence of domains in a wa y suc h that short c hains in the pathw a y are more lik ely to share domains than are random non-neigh b oring no des. Sp eciﬁcally , we randomly chose c hains of length 1 to 19 4 and added a common “domain” to every no de in that chain. In this wa y , the domain data realistically reﬂect the notion that genes sharing common protein domains are more lik ely to in teract. The lea ve-one-out results are given in T able 2 b eside the results for the true data. Figure 2 shows the precision-recall curve a veraged o ver 30 sim u- lated datasets. As in the true data analysis, the results demonstrate high pre- cision and recall, exp ecially in the “n ucleus” and “cytoplasm”. The “mem- brane” sho ws the worst precision-recall b ecause w e hav e the fewest informa- tiv e data t yp es there, but when simulating from the true data generating pro cess, we still do quite well. 4 Discussion The prop osed metho dology achiev es fairly strong predictiv e p o wer by in- tegrating data in a compartmen t sp eciﬁc wa y . Imp ortan tly , w e are able to ev aluate how eac h data t yp e con tributes to the o verall lik eliho o d of an y edge. Since eac h data t yp e independently con tributes to the probabilit y of an edge, w e can compute the fraction of the o verall lik eliho o d diﬀerence (b etw een an edge and no edge) that is due to a particular data type. In this w a y our framew ork pro vides information ab out whic h parts of a path wa y h yp othesis are not w ell supported b y a v ailable data (see Figure 3 ). In addition, our metho dology can identify if a particular data t yp e tends to disagree with the other data types for sets of edges. This could indicate whether or not a data t yp e is at all useful for mo deling edges in a particular cellular lo cation. Thus, it ma y b e p ossible to do inference on the com- partmen t map from T able 1 , rather than ﬁx it a priori. Alternativ ely , this information can b e used to c heck the v alidit y of the individual data models of Section 2 . There are some op en statistical issues that could b e addressed in future w ork. One problem with the no de data, is that the protein domains are di- 20 v erse and sparse. While there is evidence of signal here, there is an ov er-ﬁtting problem. With more domain data, or p erhaps broader domain categories, w e Figure 3: Percen tages of diﬀerential lik eliho o d (presence vs. absence of an edge) due to sp eciﬁc data types, by compartment. No de data con tribute the most in the cytoplasm (cen ter), whereas TF-DNA binding data contribute the most in the n ucleus (right). 21 ma y b e able to learn more from the prior pathw a y . If this w as the case, the lea ve-one-out results in the cytoplasm might improv e signiﬁcantly . This is eviden t from our results whic h sho w ho w borrowing domain information from other MAPK sub-pathw a ys signiﬁcantly impro v ed the p osterior probabilities of edges in the lea ve-one-out simulations. W e also noticed that most of the kno c kouts in the gene p erturbation data set we used w ere generally do wnstream. If the kno ck outs were further upstream from perturb ed genes in the n ucleus, then w e could learn about the p ossible presence of edges in a path b et w een the kno ck out and other genes. Lastly , we divided the path w a y into its three main compartments: mem- brane, cytoplasm and nucleus. Ho wev er, in future work, w e hop e to divide the path wa y more ﬁnely in to the ov er t wo dozen cellular comp onents sp eci- ﬁed by the gene ontology (GO) for the yeast S. Cer evisae . By dividing the path wa y in to more compartments, w e would also ha ve a greater degree of con trol o ver whic h data types are used in v arious parts of the cell. 4.1 Concluding remarks In this pap er we introduced a technique for reﬁning cellular pathw a y mo dels b y integrating heterogeneous data sources in a compartmen t sp eciﬁc w a y and explicitly included no de prop erties in our mo del. Our case-study results indicate that this mo del can b e useful for discov ering new comp onen ts or cross-talk with other pathw ays. Our p ow erful and ﬂexible pathw a y mo deling framew ork can b e easily extended and mo diﬁed to include additional and no vel datasets. A Supplemen tary results In this app endix we presen t more details about the simulation results. 22 Figure 4: Log p osterior probabilities for edges that were not in the hypothesis path- w a y . The v ast ma jorit y of non-edges ha ve small posterior probabilit y (third quantile at 0.02). Ho wev er, there are a few highly probable edges, which may indicate previously undisco v ered interactions. 23 Gene 1 Gene 2 Prob 1 STE12 DIG2 0.60 2 STE12 FUS1 0.39 3 STE12 FUS3 0.01 4 STE12 F AR1 0.00 5 STE12 MCM1 0.00 6 STE12 FIG2 0.43 7 STE12 FIG1 0.42 8 STE12 CIK1 0.98 9 STE12 GIC2 0.12 10 STE12 AFR1 0.01 11 STE12 KAR5 0.23 12 STE12 CHS1 0.03 13 STE12 A GA1 0.27 14 DIG2 STE12 0.68 15 DIG2 FUS3 0.00 16 STE7 STE11 0.26 17 STE7 STE5 0.21 18 STE7 FUS3 0.26 19 STE11 STE7 0.36 20 STE11 STE20 0.24 21 STE11 STE5 0.00 22 STE20 STE11 0.00 23 STE20 CDC42 0.31 24 STE20 BEM1 0.08 25 STE20 STE5 0.00 26 CDC42 STE20 0.00 27 CDC42 BNI1 0.28 28 CDC42 STE4 0.24 29 CDC42 STE18 0.31 30 CDC42 BEM1 0.47 31 CDC42 CDC24 0.50 32 FUS1 STE12 0.98 33 BNI1 CDC42 0.20 34 MF A1 STE3 0.34 35 MF A1 MCM1 0.07 36 STE2 MF(ALPHA)2 0.01 37 STE2 GP A1 0.35 38 STE2 MCM1 0.20 39 STE3 MF A1 0.30 40 STE3 GP A1 0.13 41 MF(ALPHA)2 STE2 0.36 42 GP A1 STE2 0.01 43 GP A1 STE3 0.14 44 GP A1 STE4 0.14 45 GP A1 STE18 0.12 46 STE4 CDC42 0.22 47 STE4 GP A1 0.14 48 STE18 CDC42 0.00 49 STE18 GP A1 0.13 50 BEM1 STE20 0.36 51 BEM1 CDC42 0.18 52 CDC24 CDC42 0.18 53 STE5 STE7 0.00 54 STE5 STE11 0.00 55 STE5 STE20 0.00 56 STE5 FUS3 0.00 57 FUS3 STE12 0.19 58 FUS3 DIG2 0.21 59 FUS3 STE7 0.22 60 FUS3 STE5 0.05 Gene 1 Gene 2 Prob 61 FUS3 MSG5 0.05 62 FUS3 F AR1 0.00 63 MSG5 FUS3 0.00 64 F AR1 STE12 0.73 65 F AR1 FUS3 0.27 66 F AR1 MCM1 0.27 67 MCM1 STE12 0.00 68 MCM1 MF A1 0.15 69 MCM1 STE2 0.03 70 MCM1 F AR1 0.24 71 MCM1 SWI4 0.41 72 MCM1 MF A2 0.20 73 MCM1 AGA1 0.27 74 MCM1 ALK1 0.15 75 MCM1 SWI5 0.38 76 MCM1 CDC20 0.34 77 SWI4 MCM1 0.16 78 MF A2 MCM1 0.19 79 FIG2 STE12 0.04 80 FIG1 STE12 0.98 81 CIK1 STE12 0.94 82 GIC2 STE12 0.95 83 AFR1 STE12 0.02 84 KAR5 STE12 0.37 85 CHS1 STE12 0.01 86 A GA1 STE12 0.00 87 A GA1 MCM1 0.07 88 ALK1 MCM1 0.24 89 SWI5 MCM1 0.13 90 CDC20 MCM1 0.18 T able 4: Posterior edge probabilities. 24 References Balbin, O. A., J. R. Prensner, A. Sah u, A. Y o cum, S. Shank ar, R. Malik, D. F ermin, S. M. Dhanasek aran, B. Chandler, D. Thomas, D. G. Beer, X. Cao, A. I. Nesvizhskii, and A. M. Chinnaiyan (2013, Oct). Recon- structing targetable pathw a ys in lung cancer by integrating diverse omics data. Nat Commun 4 , 2617. Bernard, A. and A. J. Hartemink (2005). Informative structure priors: joint learning of dynamic regulatory netw orks from multiple types of data. Pac Symp Bio c omput , 459–470. Brem, R. B. and L. Kruglyak (2005, F eb). The landscap e of genetic com- plexit y across 5,700 gene expression traits in yeast. Pr o c Natl A c ad Sci U S A 102 (5), 1572–1577. F riedman, N. (2004, F eb). Inferring cellular netw orks using probabilistic graphical mo dels. Scienc e 303 (5659), 799–805. F r¨ ohlic h, H., T. Beissbarth, A. T resch, D. Kostk a, J. Jacob, R. Spang, and F. Mark ow etz (2008, Nov). Analyzing gene perturbation screens with nested eﬀects mo dels in r and bio conductor. Bioinformatics 24 (21), 2549– 2550. F r¨ ohlic h, H., M. F ellmann, H. S ¨ ultmann, A. P oustk a, and T. Beißbarth (2007). Large scale statistical inference of signaling pathw a ys from rnai and microarray data. BMC Bioinformatics 8 , 386. F r¨ ohlic h, H., M. F ellmann, H. S ¨ ultmann, A. P oustk a, and T. Beißbarth (2008, Oct). Predicting path wa y membership via domain signatures. Bioinfor- matics 24 (19), 2137–2142. Gasc h, A. P ., P . T. Sp ellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P . O. Brown (2000, Dec). Genomic expression 25 programs in the resp onse of y east cells to en vironmen tal changes. Mol Biol Cel l 11 (12), 4241–4257. Gat-Viks, I. and R. Shamir (2007, Mar). Reﬁnement and expansion of signal- ing pathw ays: the osmotic resp onse netw ork in yeast. Genome R es 17 (3), 358–367. Gelman, A. (2008). A weakly informativ e default prior distribution for lo- gistic and other regression mo dels. The A nnals of Applie d Statistics 2 (4), 1360–1383. Gitter, A., M. Carmi, N. Bark ai, and Z. Bar-Joseph (2013, F eb). Linking the signaling cascades and dynamic regulatory net works controlling stress resp onses. Genome R es 23 (2), 365–376. Gruhler, A., J. V. Olsen, S. Mohammed, P . Mortensen, N. J. F aergeman, M. Mann, and O. N. Jensen (2005, Mar). Quan titative phosphopro- teomics applied to the y east pheromone signaling path wa y . Mol Cel l Pr o- te omics 4 (3), 310–327. Guan, Y., D. Gorensh teyn, M. Burmeister, A. K. W ong, J. C. Schimen ti, M. A. Handel, C. J. Bult, M. A. Hibbs, and O. G. T ro y ansk ay a (2012). Tissue-sp eciﬁc functional netw orks for prioritizing phenotype and disease genes. PL oS Comput Biol 8 (9), e1002694. Guan, Y., C. L. My ers, D. C. Hess, Z. Barutcuoglu, A. A. Caudy , and O. G. T roy ansk a ya (2008). Predicting gene function in a hierarchical context with an ensem ble of classiﬁers. Genome Biol 9 Suppl 1 , S3. Hahne, F., A. Mehrle, D. Arlt, A. Poustk a, S. Wiemann, and T. Beißbarth (2008). Extending pathw ays based on gene lists using InterPro domain signatures. BMC Bioinformatics 9 , 3. 26 Hara, K., T. Ono, K. Kuro da, and M. Ueda (2012, May). Mem brane- displa yed p eptide ligand activ ates the pheromone response pathw ay in sacc haromyces cerevisiae. J Bio chem 151 (5), 551–557. Harbison, C. T., D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford, N. M. Hannett, J.-B. T agne, D. B. Reynolds, J. Y o o, E. G. Jennings, J. Zeitlinger, D. K. P okholok, M. Kellis, P . A. Rolfe, K. T. T akusagaw a, E. S. Lander, D. K. Giﬀord, E. F raenkel, and R. A. Y oung (2004, Sep). T ranscriptional regulatory co de of a euk ary otic genome. Natur e 431 (7004), 99–104. Hibbs, M. A., D. C. Hess, C. L. My ers, C. Huttenhow er, K. Li, and O. G. T roy ansk a ya (2007, Oct). Exploring the functional landscap e of gene ex- pression: directed search of large microarra y comp endia. Bioinformat- ics 23 (20), 2692–2699. Hibbs, M. A., C. L. Myers, C. Huttenhow er, D. C. Hess, K. Li, A. A. Caudy , and O. G. T ro yansk a ya (2008). Analysis of computational functional ge- nomic approac hes for directing exp erimental biology: a case study in mi- to c hondrial inheritance. PL oS Comput Biol in pr ess . Hughes, T. R., M. J. Marton, A. R. Jones, C. J. Rob erts, R. Stough ton, C. D. Armour, H. A. Bennett, E. Coﬀey , H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Mey er, D. Slade, P . Y. Lum, S. B. Stepanian ts, D. D. Sho emak er, D. Gachotte, K. Chakraburtt y , J. Simon, M. Bard, and S. H. F riend (2000, Jul). F unctional disco v ery via a comp endium of expression proﬁles. Cel l 102 (1), 109–126. Hyduk e, D. R. and B. . P alsson (2010, Apr). T o wards genome-scale signalling net work reconstructions. Nat R ev Genet 11 (4), 297–307. Isci, S., H. Dogan, C. Ozturk, and H. H. Otu (2013, No v). Ba y esian net- w ork prior: netw ork analysis of biological data using external kno wledge. Bioinformatics . 27 Kanehisa, M. and S. Goto (2000, Jan). Kegg: kyoto encyclop edia of genes and genomes. Nucleic A cids R es 28 (1), 27–30. Kirouac, D. C., J. Saez-Ro driguez, J. Swan tek, J. M. Burke, D. A. Lauf- fen burger, and P . K. Sorger (2012). Creating and analyzing pathw a y and protein interaction comp endia for mo delling signal transduction netw orks. BMC Syst Biol 6 , 29. Knapp, B. and L. Kaderali (2013). Reconstruction of cellular signal trans- duction net works using perturbation assa ys and linear programming. PL oS One 8 (7), e69220. Kofahl, B. and E. Klipp (2004, Jul). Mo delling the dynamics of the yeast pheromone pathw a y . Y e ast 21 (10), 831–850. Letunic, I., T. Doerks, and P . Bork (2012, Jan). Smart 7: recen t updates to the protein domain annotation resource. Nucleic A cids R es 40 (Database issue), D302–D305. Li, J., H. W ei, T. Liu, and P . X. Zhao (2013, Oct). Gplexus: enabling genome-scale gene association net w ork reconstruction and analysis for very large-scale expression data. Nucleic A cids R es . Llew ellyn, R. and D. S. Eisen b erg (2008, No v). Annotating proteins with generalized functional link ages. Pr o c Natl A c ad Sci U S A . Lo, K., A. E. Raftery , K. M. Dom b ek, J. Zhu, E. E. Sc hadt, R. E. Bumgarner, and K. Y. Y eung (2012). In tegrating external biological kno wledge in the construction of regulatory net w orks from time-series expression data. BMC Syst Biol 6 , 101. Mark ow etz, F., D. Kostk a, O. G. T ro yansk a ya, and R. Spang (2007, Jul). Nested eﬀects mo dels for high-dimensional phenotyping screens. Bioinfor- matics 23 (13), i305–i312. 28 Mark ow etz, F. and R. Spang (2007). Inferring cellular net works–a review. BMC Bioinformatics 8 Suppl 6 , S5. Mazza, A., I. Gat-Viks, H. F arhan, and R. Sharan (2013, July). A minimum- lab eling approach for reconstructing protein netw orks across multiple con- ditions. McClean, M. N., A. Mo dy , J. R. Broac h, and S. Ramanathan (2007, Mar). Cross-talk and decision making in MAP kinase pathw ays. Nat Genet 39 (3), 409–414. Mukherjee, S. and T. P . Sp eed (2008, Sep). Netw ork inference using infor- mativ e priors. Pr o c. Natl. A c ad. Sci. U.S.A. 105 (38), 14313–14318. Mulder, K. W., X. W ang, C. Escriu, Y. Ito, R. F. Sch w arz, J. Gillis, G. Sirokmn y , G. Donati, S. Urib e-Lewis, P . Pa vlidis, A. Murrell, F. Marko wetz, and F. M. W att (2012, Jul). Diverse epigenetic strategies in teract to con trol epidermal diﬀeren tiation. Nat Cel l Biol 14 (7), 753–763. M ¨ uller, P ., D. Kuttenkeuler, V. Gesellc hen, M. P . Zeidler, and M. Boutros (2005, Aug). Iden tiﬁcation of JAK/ST A T signalling comp onen ts b y genome-wide rna in terference. Natur e 436 (7052), 871–875. My ers, C. L., D. Robson, A. Wible, M. A. Hibbs, C. Chiriac, C. L. Theesfeld, K. Dolinski, and O. G. T ro y ansk ay a (2005). Discov ery of biological net- w orks from diverse functional genomic data. Genome Biol 6 (13), R114. Nagiec, M. J. and H. G. Dohlman (2012, Jan). Chec kp oints in a yeast diﬀeren tiation path wa y co ordinate signaling during hyperosmotic stress. PL oS Genet 8 (1), e1002437. Nariai, N., S. Kim, S. Imoto, and S. Miyano (2004). Using protein-protein in teractions for reﬁning gene net works estimated from microarra y data by ba yesian net w orks. Pac Symp Bio c omput , 336–347. 29 Ourfali, O., T. Shlomi, T. Ideker, E. Ruppin, and R. Sharan (2007, Jul). SPINE: a framework for signaling-regulatory pathw a y inference from cause-eﬀect exp eriments. Bioinformatics 23 (13), i359–i366. P ounds, S. and S. W. Morris (2003). Estimating the o ccurrence of false p ositiv es and false negativ es in microarray studies by appro ximating and partitioning the empirical distribution of p-v alues. Bioinformatics 19 (10), 1236–1242. Pun ta, M., P . C. Coggill, R. Y. Eb erhardt, J. Mistry , J. T ate, C. Boursnell, N. Pang, K. F orslund, G. Ceric, J. Clemen ts, A. Heger, L. Holm, E. L. L. Sonnhammer, S. R. Eddy , A. Bateman, and R. D. Finn (2012, Jan). The pfam protein families database. Nucleic A cids R es 40 (Database issue), D290–D301. Reguly , T., A. Breitkreutz, L. Boucher, B.-J. Breitkreutz, G. C. Hon, C. L. My ers, A. P arsons, H. F riesen, R. Ough tred, A. T ong, C. Stark, Y. Ho, D. Botstein, B. Andrews, C. Bo one, O. G. T ro yansky a, T. Ideker, K. Dolin- ski, N. N. Batada, and M. T y ers (2006). Comprehensiv e curation and anal- ysis of global in teraction netw orks in sacc haromyces cerevisiae. J Biol 5 (4), 11. Ren, B., F. Rob ert, J. J. Wyric k, O. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. V olk ert, C. J. Wilson, S. P . Bell, and R. A. Y oung (2000, Dec). Genome-wide lo cation and function of dna binding proteins. Scienc e 290 (5500), 2306–2309. Rob erts, C. J., B. Nelson, M. J. Marton, R. Stough ton, M. R. Mey er, H. A. Bennett, Y. D. He, H. Dai, W. L. W alker, T. R. Hughes, M. Ty ers, C. Boone, and S. H. F riend (2000, F eb). Signaling and circuitry of m ultiple MAPK path wa ys rev ealed by a matrix of global gene expression proﬁles. Scienc e 287 (5454), 873–880. 30 Ry an, C. J., P . Cimerman?i?, Z. A. Szpiech, A. Sali, R. D. Hernandez, and N. J. Krogan (2013, Dec). High-resolution netw ork biology: connecting sequence with function. Nat R ev Genet 14 (12), 865–879. Sc h¨ afer, J. and K. Strimmer (2005a, Mar). An empirical Bay es approach to inferring large-scale gene asso ciation net works. Bioinformatics 21 (6), 754–764. Sc h¨ afer, J. and K. Strimmer (2005b). A shrink age approac h to large-scale co- v ariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4 , Article32. Sc hultz, J., F. Milp etz, P . Bork, and C. P . P onting (1998, May). Smart, a simple mo dular architecture researc h to ol: iden tiﬁcation of signaling domains. Pr o c Natl A c ad Sci U S A 95 (11), 5857–5864. Scott, J., T. Idek er, R. M. Karp, and R. Sharan (2006, Mar). Eﬃcient algo- rithms for detecting signaling pathw a ys in protein interaction netw orks. J Comput Biol 13 (2), 133–144. Segal, E., M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. F riedman (2003, Jun). Mo dule netw orks: identifying regulatory mo d- ules and their condition-sp eciﬁc regulators from gene expression data. Nat Genet 34 (2), 166–176. Segal, E., H. W ang, and D. Koller (2003). Disco vering molecular path w ays from protein interaction and gene expression data. Bioinformatics 19 Suppl 1 , i264–i271. Simon, I., J. Barnett, N. Hannett, C. T. Harbison, N. J. Rinaldi, T. L. V olkert, J. J. Wyric k, J. Zeitlinger, D. K. Giﬀord, T. S. Jaakkola, and R. A. Y oung (2001, Sep). Serial regulation of transcriptional regulators in the yeast cell cycle. Cel l 106 (6), 697–708. 31 Stark, C., B.-J. Breitkreutz, T. Reguly , L. Boucher, A. Breitkreutz, and M. T yers (2006, Jan). BioGRID: a general rep ository for interaction datasets. Nucleic A cids R es 34 (Database issue), D535–D539. Stelniec-Klotz, I., S. Legewie, O. Tc hernitsa, F. Witzel, B. Klinger, C. Sers, H. Herzel, N. Blthgen, and R. Sc hfer (2012). Rev erse engineering a hierar- c hical regulatory net w ork downstream of oncogenic kras. Mol Syst Biol 8 , 601. T resch, A. and F. Marko w etz (2008). Structure learning in nested eﬀects mo dels. Stat Appl Genet Mol Biol 7 , Article9. W ang, X., M. A. Castro, K. W. Mulder, and F. Marko w etz (2012). Posterior asso ciation net w orks and functional modules inferred from ric h phenot yp es of gene p erturbations. PL oS Comput Biol 8 (6), e1002566. W ang, X., K. Y uan, C. Hellma yr, W. Liu, and F. Marko w etz (2013). Re- constructing ev olving signaling netw orks b y hidden marko v nested eﬀects mo dels. Annals of Applie d Statistics ac epte d , . W erhli, A. V. and D. Husmeier (2007). Reconstructing gene regulatory net- w orks with Bay esian net w orks by com bining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol 6 , Article15. W orkman, C. T., H. C. Mak, S. McCuine, J.-B. T agne, M. Agarwal, O. Ozier, T. J. Begley , L. D. Samson, and T. Idek er (2006, Ma y). A systems approach to mapping dna damage resp onse pathw a ys. Scienc e 312 (5776), 1054– 1059. Y ates, P . D. and N. D. Mukhopadhy a y (2013). An inferen tial framew ork for biological netw ork hypothesis tests. BMC Bioinformatics 14 , 94. Y eang, C.-H., H. C. Mak, S. McCuine, C. W orkman, T. Jaakkola, and T. Ideker (2005). V alidation and reﬁnement of gene-regulatory pathw ays on a net work of physical in teractions. Genome Biol 6 (7), R62. 32 Yip, K. Y., R. P . Alexander, K.-K. Y an, and M. Gerstein (2010). Impro v ed reconstruction of in silico gene regulatory net works by in tegrating knock out and p erturbation data. PL oS One 5 (1), e8121. 33

Estimating cellular pathways from an ensemble of heterogeneous data sources

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment