Subspace clustering has gained increasing popularity in the analysis of gene expression data. Among subspace cluster models, the recently introduced order-preserving sub-matrix (OPSM) has demonstrated high promise. An OPSM, essentially a pattern-based subspace cluster, is a subset of rows and columns in a data matrix for which all the rows induce the same linear ordering of columns. Existing OPSM discovery methods do not scale well to increasingly large expression datasets. In particular, twig clusters having few genes and many experiments incur explosive computational costs and are completely pruned off by existing methods. However, it is of particular interest to determine small groups of genes that are tightly coregulated across many conditions. In this paper, we present KiWi, an OPSM subspace clustering algorithm that is scalable to massive datasets, capable of discovering twig clusters and identifying negative as well as positive correlations. We extensively validate KiWi using relevant biological datasets and show that KiWi correctly assigns redundant probes to the same cluster, groups experiments with common clinical annotations, differentiates real promoter sequences from negative control sequences, and shows good association with cis-regulatory motif predictions.
Deep Dive into KiWi: A Scalable Subspace Clustering Algorithm for Gene Expression Analysis.
Subspace clustering has gained increasing popularity in the analysis of gene expression data. Among subspace cluster models, the recently introduced order-preserving sub-matrix (OPSM) has demonstrated high promise. An OPSM, essentially a pattern-based subspace cluster, is a subset of rows and columns in a data matrix for which all the rows induce the same linear ordering of columns. Existing OPSM discovery methods do not scale well to increasingly large expression datasets. In particular, twig clusters having few genes and many experiments incur explosive computational costs and are completely pruned off by existing methods. However, it is of particular interest to determine small groups of genes that are tightly coregulated across many conditions. In this paper, we present KiWi, an OPSM subspace clustering algorithm that is scalable to massive datasets, capable of discovering twig clusters and identifying negative as well as positive correlations. We extensively validate KiWi using re
Numerous studies have used coexpression of large expression datasets to infer functional associations between genes [1], to identify groups of related genes that are important in specific cancers or represent common tumour progression mechanisms [2], to study evolutionary change [3], for integration with other large-scale datasets [4] [5], [6], and for the generation of high-quality biological interaction networks [7][8] [9] [10]. A number of studies have also attempted to use coexpression to identify coregulation with the hypothesis that if two or more genes are expressed at the same time and location and at similar levels then they may be regulated by the same transcription factors and regulatory elements. This approach has shown promise particularly in simpler model organisms such as A. thaliana and S. cerevisiae [11] [12][13] [14] and many groups are currently working on implementing this idea in mammalian systems.
However, traditional clustering methods have not worked particularly well on large datasets for this problem. Most methods assign each gene to only one cluster while in reality many genes likely take part in multiple processes. Also, global coexpression is measured across all conditions, whereas, it is probable that most genes are only tightly coregulated under certain conditions or locations.
In recent years, a new field of clustering analysis termed subspace clustering (or biclustering) has gained increasing popularity in the analysis of gene expression data and other biological data [15][16] [17][18] [19]. In contrast to traditional clustering methods such as hierarchical clustering, subspace clustering methods do not require expression to be correlated across all conditions for genes to be assigned to the same cluster. This has several advantages for data in which biologically relevant subsets exist (e.g. different tissue types) or where a few noisy experiments might significantly bias the results of the clustering algorithm. This also allows assignment of genes to multiple clusters for different subsets of experimental conditions.
More recently, the order-preserving sub-matrix (OPSM) has been introduced and demonstrated as a biologically meaningful subspace cluster model [15] [20]. An OPSM, essentially a pattern-based subspace cluster, is a subset of rows and columns in a data matrix for which all the rows induce the same linear ordering of columns. In terms of gene expression, an OPSM might represent a group of coregulated genes whose expression levels rise and fall synchronously in response to a series of environmental or cellular stimuli.
A recent report reviewed several existing biclustering methods [17]. They found that in general, biclustering methods outperform global methods such as hierarchical clustering. They also showed that OPSM had the highest proportion of clusters with significant enrichment of one or more Gene Ontology (GO) categories and had good correspondence with known pathways according to their analysis of A. thaliana metabolic pathways and S. cerevisiae protein-protein interaction networks. However, they state that there are considerable performance differences between the tested methods. Performance is a significant factor as the size of the subspace clustering search space (searching all possible subspaces) is nearly infinite and increases exponentially with the size of the dataset to be analyzed. As costs of expression analysis continue to decrease, the numbers and sizes of expression datasets have grown at an ever-increasing rate. The Gene Expression Omnibus (GEO) currently holds over 70,000 samples for over 100 different organisms [21] and the Stanford Microarray Database (SMD) contains over 10,000 public experiments for over 20 organisms [22]. Furthermore, as array designs continue to improve, it has become possible to include probes for essentially all known genes for many species. The development of exon arrays, alternative splicing arrays, wholegenome tiling arrays, SAGE-type experiments, and highthroughput sequencing technologies increases the size of the Obi L. Griffith † , Byron J. Gao ‡ , Mikhail Bilenky † , Yuliya Prychyna † , Martin Ester § and Steven J.M. Jones † † BC Cancer Agency, Canada ‡ Texas State University -San Marcos, USA § Simon Fraser University, Canada obig@bcgsc.ca, bgao@txstate.edu, mbilenky@bcgsc.ca, y_prychyna@yahoo.ca, ester@cs.sfu.ca, sjones@bcgsc.ca problem even further. Thus, with expression data of potentially tens or hundreds of thousands of both rows and columns we need algorithms that can handle not only ’large datasets’ but ‘massive datasets’.
In our experience, we have found that existing subspace clustering methods do not scale well to these larger expression datasets. We have attempted to analyze large datasets with a Gibbs sampling based biclustering algorithm called gene expression mining server (GEMS) [18]; an adaptive qualitybased clustering algorithm [16]; an OPSM-based (OP-cluster) algorithm [20]; and Prefi
…(Full text truncated)…
This content is AI-processed based on ArXiv data.