Large Collection of Diverse Gene Set Search Queries Recapitulate Known Protein-Protein Interactions and Gene-Gene Functional Associations

Lar ge Collection of Di v erse Gene Set Search Queries Recapitulate Kno wn Protein-Protein Interactions and Gene-Gene Functional Associations Neil R. Clark 1 , 2 , 3 , A vi Ma’ayan 1 , 2 , 3 , ∗ 1 Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustav e L Levy Place, Box 1215, Ne w Y ork, NY , USA 2 Big Data to Knowledge (BD2K) Library of Inte grated Network-based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC) 3 Mount Sinai Knowledge Management Center (KMC) for Illuminating the Drugg able Genome (IDG) ∗ Corresponding author: avi.maayan@mssm.edu Abstract —Popular online enrichment analysis tools from the ﬁeld of molecular systems biology pro vide users with the ability to submit their experimental results as gene sets for individual anal- ysis. Such queries ar e kept priv ate, and hav e never bef ore been considered as a resource for integrative analysis. By harnessing gene set query submissions from thousands of users, we aim to discover biological knowledge beyond the scope of an individual study . In this work, we in vestigated a large collection of gene sets submitted to the tool Enrichr by thousands of users. Based on co-occurrence, we constructed a global gene-gene association network. W e interpr et this inferred network as providing a summary of the structure present in this crowdsour ced gene set library , and show that this network recapitulates known pr otein- protein interactions and functional associations between genes. This ﬁnding implies that this network also offers predictiv e value. Furthermore, we visualize this gene-gene association network using a new edge-pruning algorithm that retains both the local and global structures of lar ge-scale networks. Our ability to make predictions for currently unkno wn gene associations, that may not be captured by individual researchers and data sources, is a demonstration of the potential of harnessing collective knowledge from users of popular tools in the ﬁeld of molecular systems biology . Keyw ords - gene sets, big data, item set mining I . I N T RO D U C T I O N Systems approaches to genome-wide molecular data are increasingly using gene sets, as opposed to indi vidual genes, as the basic units of analysis. For example, in the ﬁeld of molec- ular diagnostics, biomarker sets, as opposed to single gene biomarkers, are increasingly being applied [1]. Beyond molec- ular diagnostics, genomics, transcriptomics and proteomics experiments often identify gene sets that are differentially ex- pressed, or gene sets that contain genetic v ariations associated with a phenotype. Using methods such as enrichment analyses [2]–[4], common biological functions can illuminate prior knowledge originating from the collection of experimentally identiﬁed gene sets. One approach to enrichment analysis is to take a set of experimentally identiﬁed genes and analyze these genes by comparing them to a library of gene sets with a known associated biological theme, e.g., members of cell signaling pathways, or targets of transcription factors. Enrichr [2] is a popular online tool we developed to enable users to submit gene sets for this type of analysis. Since its launch in 2013, ov er 900,000 gene sets have been submitted to Enrichr by over 25,000 unique users. These gene sets originate from di verse experimental platforms, proﬁling genes and proteins at v arious regulatory lev els of mammalian molecular data. W e hypothesize that the gene set submitted to Enrichr may contain patterns and structure that could rev eal novel associa- tions between genes, gene modules and molecular biological mechanisms. The large size of this data set and the diversity of the scientiﬁc community from which it originates potentially provide a unique perspectiv e that may rev eal new insights. In combinatorial mathematical terms, the nearly 1 million gene sets that were submitted to Enrichr comprise a family of sets, i.e., a collection of subsets from the set of all genes in the human, mouse and rat genomes. A common approach to the analysis of this form of data is to treat each set as an item-set in analogy to a market basket, and identify frequently occurring combinations of items [5]. An alternative has been proposed in which item-sets that are logically related are identiﬁed [6]. The identiﬁcation of logical item-sets is particularly rele vant for the gene sets submitted to Enrichr because this approach takes into account rare items, and because biological knowledge is incomplete, logically associated groups of items may only be partially represented in any giv en basket. These are important properties, as we aim to identify functional modules ev en for rarely occurring genes. W e expect that due to partial information contained within an individual query , not all members of a rele vant functional module will be present in any given gene set submitted for analysis with Enrichr . This partial information is expected to produce high level of false positiv e associations. Due to the expected high false positiv e rates, a degree of noise-ﬁltering may also improv e the accurac y of such analysis. Here we provide an interpretable picture of the global properties of a large collection of the lists submitted to Enrichr by thousands of users. T o accomplish this we employ the method of logical item-set to extract gene modules. As a con- sequence, we infer a network of relationships between genes and visualize this network using a novel edge-pruning method we devised. Next, we examine the degree of compatibility of our ﬁndings with annotated gene set libraries such as the Human Gene Atlas or the Gene Ontology (GO), as well as with known protein-protein interactions. I I . R E S U LT S By the middle of 2015, the Enrichr gene set queries con- sisted of 172,798 gene sets submitted from 5,114 unique in- ternet protocol (IP) addresses. Our analysis began by applying a number of ﬁlters. First, we removed any query that did not correspond to an ofﬁcial gene symbol. Next, we ensured that no individual users dominated in their contributions to the data set by removing all entries from users that contributed many gene sets (see methods for more details). After this strict cleaning criteria, the result was a family of 19,196 gene sets, composed of 27,770 genes supplied from 3,308 unique IP addresses of unique users of Enrichr . A. Enrichr gene set library metrics The distribution of gene frequencies is surprisingly complex (Fig. 1). The distribution approximately decays exponentially from frequencies greater than 300; howe ver , the lower fre- quencies are characterized by a bimodal distribution. The large mode at the lowest frequencies has an overrepresentation of uncharacterized genes, gene names with no functional associations or known protein-protein interactions; but e ven discounting those genes, this peak remains relati vely large. The lower frequencies are also characterized by a mode at around 200 occurrences. The frequency of occurrence of genes in submissions to Enrichr has a characteristic frequenc y that decays exponentially , as opposed to being characterized by a power law . Given the predominance of scale-free distributions, it is perhaps surprising that there is a deﬁnite scale for frequencies of gene occurrence in this data set. The distribution of the lengths of gene sets submitted to Enrichr approximately ﬁts a power-la w , with a notable peak at about 200 and 400 genes (Fig. 2). The distrib ution of the number of submissions to Enrichr per user IP address also follows a power -law , suggesting that most users submit only a few lists, but there is a substantial body of heavy users. B. Gene-gene and gene-set/gene-set network infer ence The Enrichr collection of gene set queries can be transposed, such that for each gene there is an associated set of Enrichr submission queries. T o explore the structure in this data, and also to potentially form the basis for an informativ e decomposition of the data, we can infer networks of similar- ities between gene pairs and, in addition, networks of Enricr queries. W e employ the methods developed in [6] to infer noise-ﬁltered networks of associations between these entities. 0 200 400 600 800 1000 1200 1400 1 10 100 1000 10 4 Enrichr gene freq Count. Distributio n of gene frequencies Fig. 1. The distribution of the occurrences of unique gene symbols in Enrichr gene set queries. 50 100 500 1000 10 - 6 10 - 5 10 - 4 10 - 3 Enrichr line length Count. Distributio n of query submissions lengths Fig. 2. The distribution of the size of each gene set query submitted to Enrichr . In order to visualize the network of gene associations, we employ an edge-pruning algorithm that results in an inter- pretable representation of the global network structure while preserving local features of the network topology (Fig. 3a). A selection of the largest local structures, preserved by the edge pruning algorithm, is shown in Fig. 3b . One potential use of this network is to make predictions of novel associations based on prior knowledge. As an example of this, we highlight genes known to be associated with Adult Onset Diabetes (KEGG) (Fig. 3c). Illustrating local network structures that include at least one of these genes suggests local structures as candidate nov el predictions for genes likely associated with the adult- onset diabetes pathway . One of the local structures contains the adult-onset diabetes Genes MAF A and NEUROG3 (KEGG) along with two other genes, UTS2R and FLJ45717, that are not identiﬁed in KEGG as being associated with diabetes. Howe ver , UTS2 and its receptor UTS2R have been reported [7] to be in volved in glucose metabolism and insulin resistance, which lead to the dev elopment of type-2 diabetes in humans. There is no known a b ONECUT1 NR5A2 MAF A NEUROG3 HNF4G MNX1 INS HNF4A FOXA2 FOXA3 BHLHB8 c Fig. 3. The edge-pruned network of gene associations inferred from Enrichr queries. (a) The global vie w of the network. (b) A selection of the largest local network structures. (c) As an example of one possible use of this network, we highlight genes known to be associated with adult-onset diabetes (KEGG) and illustrate the local network structure that includes at least one of these genes. association of FLJ45717 with diabetes; howe ver , as this gene forms the center of a star graph local structure, we consider it a nov el candidate for in volvement in adult-onset diabetes. W e can also apply the same approach to the network of associations between gene set queries submitted to Enrichr (Fig. 4). This network appears to show a degree of clustering by user; this perhaps reﬂects individual users special research interests, or a common method of data acquisition. C. Comparing the clustering of Enrichr networks to random scale-fr ee networks The clustering coefﬁcients of the gene and gene-set net- works are 0.43 and 0.17, respecti vely . In order to gauge the signiﬁcance of this, we compared these clustering coefﬁcients to the typical clustering coefﬁcients for a random scale-free network generated by the Barabasi-Albert random scale-free graph model. Artiﬁcial shuf ﬂed networks of a similar size and degree distribution generated by the Barabasi-Albert model hav e a typical clustering coefﬁcient of 0.0026 and 0.0054 respectiv ely . Hence, we see that the Enrichr queries gene, and gene-set, networks have a clustering coefﬁcient which is orders of magnitude greater than would be expected if the networks were randomly scale-free by the Barabasi-Albert model. D. Examining the gene network for r ecovery of known pr otein- pr otein interactions Next, we asked whether the gene-gene association network created from the Enrichr queries can be used to predict physical protein interactions. W e used the PSICQUIC database of protein-protein interactions (PPIs) and examined the statis- tical signiﬁcance of the o verlap between the PSICQUIC PPI network and the gene-gene network inferred from Enrichr sub- missions. After applying a threshold of 0.3 for the similarity of a normalized point-wise mutual information for the gene-gene association network, two binary matrices remained. In the ﬁrst test of the signiﬁcance of the similarity between these two binary matrices, we count the number of non- zero entries in each matrix and the number of elements that are nonzero in both matrices. Based on a null distribution whereby the matrices are randomly permuted, we can use the hypergeometric distrib ution to quantify the signiﬁcance of the ov erlap between the two matrices. There are n 1 = 13 , 586 genes that are in both the Enrichr queries network and the PSICQUIC PPI network. W ith a threshold on the normalized point-wise mutual information of 0.05, there are n 1 = 99 , 758 edges in the Enrichr network and n 2 = 156 , 388 known PPIs. There are n k = 2 , 763 edges that are present in both the Enrichr network and the PSICQUIC PPI network. In this regime, the Poisson approximation to the hypergeometric distribution is applicable. W ith a total number ( a ) ( b ) Fig. 4. The edge-pruned network of associations between query submissions to Enrichr . (a) The global view of the network. (b) A selection of the largest local structures. (c) The gene sets submitted by the ﬁve Enrichr users with the largest number of submissions are highlighted, each with a different color . of possible edges giv en by n g (1 − n g ) , the rate parameter for the Poisson distribution is giv en by: n 1 n 2 n g (1 − n g ) = 169 . 1 (1) Under the null hypothesis that edges are randomly assigned, we expect a mean number of shared edges to be 169.1 with the standard deviation of approximately 13.0. Consequently , the observed v alue of n k = 2 , 763 is about 200 standard de viations greater than the expected value. Therefore, under the null hypothesis, the number of edges present in both the Enrichr gene network and the PSICQUIC PPI network is extremely statistically signiﬁcant. It is concei vable that this result is due to users submitting gene set queries containing proteins known to have many interactors, and hence the null distribution we used for the calculation would be inappropriate and the result inv alid. One way to account for this is to condition the test on the frequency of each gene. T o do this, we perform a separate hypergeometric test for each gene, and correct for multiple hypothesis testing. There are 899 proteins that hav e at least one interaction that overlaps with the Enrichr network; when corrected for multiple hypotheses testing, with a False Dis- cov ery Rate of 2 Another possible explanation for the apparently signiﬁcant ov erlap between edges in the Enrichr gene network and the PSICQUIC PPI network is that there are a signiﬁcant number of user queries containing genes with known PPIs. T o address this, we recov ered the Enrichr lists that contribute to the prediction of at least one PPI and counted the number of proteins in each query list for which there is another protein in the query with which it interacts. If a user submitted a list of proteins with known protein interactions, then the ratio of the counted proteins to the length of the query will be large. W e note that the actual ratio is less than 1%, which suggests that users are not submitting gene sets with known PPIs to Enrichr to any signiﬁcant degree. This suggests that Fig. 5. The part of the network of the gene-gene associations inferred from Enrichr that is also supported by kno wn protein-protein interactions from PSICQUIC. the signiﬁcant overlap in the edges in the gene network inferred from Enrichr submissions is predictiv e of known and potentially nov el protein-protein interactions. W e sho w the network of edges between genes that are present in both the Enrichr-inferred gene network and the PSICQUIC PPI network (Fig. 5). 0.0 0.1 0.2 0.3 0.4 0.5 Mouse Gene Atlas Human Gene Atlas Reactome pathways ChEA CCLE OMIM diseas e genes KEGG pathways WikiPathway s Chromosom e location GeneOntolog y MF Structural Domains HMDB Metabolites NCI60 PPI Hub Proteins GeneOntolog y BP GeneOntolog y CC GeneSigDB CORUM NURSA - IPMS MGI MP top4 MGI MP top3 BioCarta pathways KEA TF PPIs microRNAs TF PWMs VirusMINT OMIM Expanded Fraction of gene sets significantl y clustered ( FDR 1 %) Fig. 6. The proportion of gene sets in each library that are signiﬁcantly clustered in the gene-gene association network inferred from queries to Enrichr . E. Evaluating the r ecapitulation of prior knowledge about gene functions with the gene-gene association network inferr ed fr om Enrichr queries W e took a collection of 28 annotated gene set libraries that represent prior knowledge of associations between genes and their functions from a range of biological themes, including Gene Ontology , mouse phenotypes, protein complexes, histone modiﬁcations and DN A binding. W e addressed the question of the extent to which the information in these gene set libraries is recapitulated by the gene-gene association networks inferred by Enrichr queries. In order to test this for a single gene set, we mapped the genes in the set to nodes in the Enrichr queries gene-gene association network, and then calculated the mean point-wise mutual information between each pair of genes. In order to assess the signiﬁcance of this collective measure of similarity , we numerically calculated a null distribution by randomly choosing a set of nodes in the Enrichr queries’ network with the same cardinality as the gene set in question. The random choice was weighted by the overall frequency of the gene in the gene set library from which it originates. This is equiv alent to generating a null distribution based on random permutation of the gene set library repeated many times. By comparing the actual collectiv e similarity to the null distribution, we calculate a signiﬁcance p value for the collectiv e similarity of the gene set in the Enrichr queries’ gene-gene association network. A small p v alue indicates that the members of the gene set in question are signiﬁcantly more similar to each other in the network inferred by Enrichr queries than w ould be expected if the gene set library from which the set was originated was randomly permuted. W e interpret a small p value as indicating that the network inffered by the Enrichr queries has recovered the gene set in question. The list of all gene set libraries examined (T able I). After correcting for multiple hypotheses testing and setting a false discovery rate threshold of 1%, we counted the number of gene sets in each library that are signiﬁcantly recovered in the Enrichr queries’ gene-gene network (Fig. 6). W e observed that for some libraries, over 30% of the gene T ABLE I G E NE S ET L I BR A R I ES U SE D T O A S S E SS R EC OV E RY O F F U N CT I O NA L A S SO C I A T I ON S B Y T HE N ET W O RK I N FE R R ED F RO M T H E E N RI C H R Q UE R I E S . Gene Set Library # of Sets Mean Set Size BioCarta pathways 249 18 Cancer Cell Line Encyclopedia 967 176 ChEA 240 1456 Chromosome location 386 85 COR UM 1673 5 Gene Ontology Biological Process 941 78 Gene Ontology Cellular Component 205 172 Gene Ontology Molecular Function 402 122 GeneSigDB 2139 127 Genome Browser PWMs 615 275 HMDB Metabolites 3906 47 Human Gene Atlas 84 450 KEA 474 37 KEGG pathways 200 48 MGI MP top 3 71 717 MGI MP top 4 476 202 microRN As 222 155 Mouse Gene Atlas 96 660 NCI60 93 343 NURSA-IP-MS 1796 158 OMIM disease genes 90 25 OMIM Expanded 187 89 Pfam-InterPro-domains 311 35 PPI Hub Proteins 385 247 Reactome pathways 78 73 TF PPIs 290 79 V irusMINT 85 15 W ikiPathways pathways 199 39 sets in the library are recoverable by the Enrichr queries’ gene-gene association network. The libraries containing data from genome-wide gene expression proﬁling hav e the most ov erlap, follo wed by ChEA, which has lists of genes associated with transcription factors from ChIP-seq proﬁling. Next are pathway libraries such as reactome, KEGG and W ikiPathw ays. This rank order of libraries likely reﬂects the nature of most submitted lists to analysis with Enrichr . These query submissions are likely dominated by dif ferentially expressed genes from genome-wide mRNA proﬁling. The recovery of annotated pathways suggests that the inferred gene-gene as- sociation network likely contains new and more complete pathway knowledge. I I I . D I S C U S S I O N A N D C O N C L U S I O N S In this study , we analyzed for the ﬁrst time, a large collec- tion of gene set queries submitted by the many users of the popular enrichment analysis tool Enrichr . W e examined the distribution of occurrence of genes, the length of submitted gene sets, and the distribution of submissions by individual users. W e show that the distribution of gene occurrence in queries is complex, with some genes appearing often in submitted lists. The length of gene set queries peaks at 200- 400 genes per query , and the distribution of submitters follows a power law . By constructing a gene-gene association network of co-occurrence, we were able to show that such a network captures kno wn protein-protein interactions and functional associations much more frequently than by chance. This means that the collective data from thousands of queries potentially holds new knowledge about the local and global structure of the human functional interactome. As more submissions accumulate, it is expected that such predictions will be reﬁned and improved. While experimental validation of these predic- tions is outside the scope of this present work, the gene-gene association network resource we de veloped here can be used to prioritize predictions for general and speciﬁc hypotheses that can be tested experimentally and systematically drive rational research explorations. The reuse of priv ately submitted gene sets by the Enrichr users should be handled with caution. The analysis that we conducted here and the gene-gene association networks we constructed keep the identities of users pri vate. Furthermore, we do not provide the gene sets used for our analysis publicly . I V . M E T H O D S A. Data Prepr ocessing As of June 25th 2015, 172,798 queries of gene sets were submitted to Enrichr from 5,114 unique IP addresses. In order to analyze the global structure of these gene sets, preprocessing was necessary because the raw data include multiple instances of the example data set supplied by the Enrichr website for demonstration purposes, special characters and strings that cannot be interpreted as gene lists caused by erroneous input from users, and large collections of queries from individual users who utilized the Enrichr API. These entries were removed so that the resulting dataset is repre- sentativ e of the entire community of Enrichr users and is not dominated by any individual user or small subset thereof. The ﬁrst ﬁlter was applied to retain only word strings that are members of a reference set of 39920 standard human, mouse and rat gene names. The resulting sets of genes corresponding to each user input list was then subjected to a sequence of subsequent ﬁlters. Use of the Enrichr API enables programmatic access to Enrichr . This resulted in a minority of users submitting a large number of gene lists. In order for the data set to be balanced and representativ e of the entire community of Enrichr users, and not dominated by a small minority of users, we ﬁltered out gene lists associated with IP addresses from which large numbers of lists were receiv ed. The results of 11,579 differential expression analyses were receiv ed from the GEO2Enrichr Chrome Extension, which submits three gene sets for every differential expression com- putation for analysis with Enrichr: upregulated, do wn re gulated and combined gene sets, respectiv ely . Only the combined gene sets were retained in this analysis. Gene sets that contained more than 2,000 genes were rejected on the basis that they were non-speciﬁc and incurred undue computational expense in subsequent analysis. Finally , all instances of the demonstration gene set from the Enrichr website were remov ed. The result was 19,196 gene sets composed of 27,770 genes supplied from 3,308 unique IP addresses. B. Network Inference W e employ the methods described in [6] to reconstruct networks from the ﬁltered gene sets. The most basic approach to construct a network from a large collection of gene sets is to deﬁne a distance matrix between the gene sets and examine the resulting distance matrix. The most obvious similarity measure between two gene sets A and B is the Jaccard index: J ( A, B ) = | A ∩ B | | A ∪ B | (2) Let the set of all elements be V , and the family of sets be M, then: M =  m ( n ) = n m ( n ) l o L n l =1 ⊂ V  (3) and the co-occurrence counts of pairs of genes is giv en by: φ ( α, β ) = N X n =1 δ ( α ∈ m ( n ) ) δ ( β ∈ m ( n ) ) (4) where δ ( bool ) is a Dirac delta function deﬁned such that: δ ( bool ) = ( 1 if bool = T rue, 0 Otherwise The marginal counts are then deﬁned in terms of the co- occurrence counts as: φ ( α ) = X β ∈ G,α 6 = β φ ( α, β ) (5) which is the number of all co-occurrences. Finally , the total number of co-occurrences is giv en by: φ 0 = 1 2 X α ∈ V φ ( α ) = 1 2 X α ∈ V X β ∈ V φ ( α, β ) (6) Then the co-occurrence probability is giv en by: P ( α, β ) = φ ( α, β ) φ 0 (7) and the marginal probability is giv en by: P ( α ) = φ ( α ) /phi 0 (8) The Jaccard index can now be deﬁned in these terms as: ψ j ac = P ( α, β ) P ( α ) + P ( β ) − P ( α, β )) (9) When the collection of gene set queries is transposed such that the genes are the labels and the queries are the members of sets, the Jaccard index can be applied. W e can also use the following distance measures: • Cosine Distance: ψ cos = P ( α, β ) / p P ( α ) P ( β ) • Point-Wise Mutual Information: ψ pmi = max { 0 , Log ( P ( α, β ) /P ( α ) P ( β )) } • Normalized Point-Wise Mutual Information: ψ nmi = − ψ pmi / ( Log ( P ( α, β ))) W e apply a noise ﬁlter by iterativ ely applying a threshold for similarity: the pair counts for gene pairs that do not pass the threshold are set to zero, and the similarity matrix is recalculated. Throughout the rest of this analysis, we only use the normalized point-wise mutual information as a measure of similarity between genes for the Enrichr queries data. C. Network visualization Because the resultant networks are densely connected, direct visualization of the graph is not interpretable. Edge pruning has been proposed for the simpliﬁcation of graphs to aid in generating interpretable visualizations of their structure [8]. W e ﬁnd that such edge pruning algorithms, applied to very large graphs, are able to represent the global structure at the expense of the local structure. T o potentially improve this, we employed an edge pruning approach that is able to capture both global and local structure in the graph. In the initial step, a set of local structures is deriv ed by varying the threshold for similarity and identifying connected graph components at the point at which they disconnect from the giant component. Such point is deﬁned as any connected component that contains more than 10% of all nodes or has an absolute number of nodes greater than 100 nodes; this number is chosen based on the visualizability properties of the resulting local structures. The local structures represent local regions of the network that are distinguishable from the rest of the network at an ap- propriate scale of similarity . As such, they depict the similarity relationships between the nodes and their neighborhood in the network. Once the local structures have been identiﬁed, we employ a naiv e edge-pruning algorithm. W e do not prune edges that are members of the local structures, thereby preserving them in the ﬁnal simpliﬁed network. The number of edges that are av ailable to be pruned is equal to the difference between the total number of edges in the original network, n e , and the number of edges that are part of the local structures, n e ; ls . The pruning process preserves the connectivity of the network, so after the maximum degree of pruning, the minimum number of edges remaining must be equal to: n ls + n e ; ls (10) Accordingly , the maximum number of edges that can be pruned while preserving the connectivity of the network is: n e − n e ; ls − n ls (11) The iterati ve process by which edges are pruned is as follows: 1) Set a counter i = 0 2) Identify the edge with the lowest similarity measure 3) If removing the edge does not cut the network, then remov e it. 4) i → i + 1 5) If i < γ ( n e − n e ; ls − n ls ) then return to step 1; otherwise, stop. A C K N O W L E D G M E N T W e would like to thank Dr . Kathleen M. Jagodnik for pro- viding useful comments and copyediting. Funding: This work was supported in part by grants from the NIH: R01GM098316, U54HG008230 and U54CA189201 to AM. R E F E R E N C E S [1] X. W ang, E. Dalkic, M. W u, and C. Chan, “Gene module le vel analysis: identiﬁcation to networks and dynamics, ” Curr ent opinion in biotechnol- ogy , vol. 19, no. 5, pp. 482–491, 2008. [2] E. Y . Chen, C. M. T an, Y . Kou, Q. Duan, Z. W ang, G. V . Meirelles, N. R. Clark, and A. Ma?ayan, “Enrichr: interacti ve and collaborati ve html5 gene list enrichment analysis tool, ” BMC bioinformatics , vol. 14, no. 1, p. 128, 2013. [3] D. W . Huang, B. T . Sherman, and R. A. Lempicki, “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, ” Nucleic acids resear ch , vol. 37, no. 1, pp. 1–13, 2009. [4] A. Subramanian, P . T amayo, V . K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy , T . R. Golub, E. S. Lander et al. , “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression proﬁles, ” Proceedings of the National Academy of Sciences of the United States of America , vol. 102, no. 43, pp. 15 545–15 550, 2005. [5] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic itemset counting and implication rules for market basket data, ” in ACM SIGMOD Recor d , vol. 26, no. 2. ACM, 1997, pp. 255–264. [6] S. Kumar , V . Chandrashekar, and C. Jawahar , “Logical itemset mining, ” in Data Mining W orkshops (ICDMW), 2012 IEEE 12th International Confer ence on . IEEE, 2012, pp. 603–610. [7] Z. Jiang, J. J. Michal, D. J. T obey , Z. W ang, M. D. MacNeil, and N. S. Magnuson, “Comparative understanding of uts2 and uts2r genes for their inv olvement in type 2 diabetes mellitus, ” International journal of biological sciences , vol. 4, no. 2, p. 96, 2008. [8] F . Zhou, S. Malher, and H. T oi vonen, “Network simpliﬁcation with minimal loss of connectivity , ” in Data Mining (ICDM), 2010 IEEE 10th International Conference on . IEEE, 2010, pp. 659–668.

Large Collection of Diverse Gene Set Search Queries Recapitulate Known Protein-Protein Interactions and Gene-Gene Functional Associations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment