Partition Decoupling for Multi-gene Analysis of Gene Expression Profiling Data

Reading time: 5 minute
...

📝 Original Info

  • Title: Partition Decoupling for Multi-gene Analysis of Gene Expression Profiling Data
  • ArXiv ID: 1002.3946
  • Date: 2010-02-21
  • Authors: Rosemary Braun, Gregory Leibon, Scott Pauls, Daniel Rockmore

📝 Abstract

We present the extention and application of a new unsupervised statistical learning technique--the Partition Decoupling Method--to gene expression data. Because it has the ability to reveal non-linear and non-convex geometries present in the data, the PDM is an improvement over typical gene expression analysis algorithms, permitting a multi-gene analysis that can reveal phenotypic differences even when the individual genes do not exhibit differential expression. Here, we apply the PDM to publicly-available gene expression data sets, and demonstrate that we are able to identify cell types and treatments with higher accuracy than is obtained through other approaches. By applying it in a pathway-by-pathway fashion, we demonstrate how the PDM may be used to find sets of mechanistically-related genes that discriminate phenotypes.

💡 Deep Analysis

Deep Dive into Partition Decoupling for Multi-gene Analysis of Gene Expression Profiling Data.

We present the extention and application of a new unsupervised statistical learning technique–the Partition Decoupling Method–to gene expression data. Because it has the ability to reveal non-linear and non-convex geometries present in the data, the PDM is an improvement over typical gene expression analysis algorithms, permitting a multi-gene analysis that can reveal phenotypic differences even when the individual genes do not exhibit differential expression. Here, we apply the PDM to publicly-available gene expression data sets, and demonstrate that we are able to identify cell types and treatments with higher accuracy than is obtained through other approaches. By applying it in a pathway-by-pathway fashion, we demonstrate how the PDM may be used to find sets of mechanistically-related genes that discriminate phenotypes.

📄 Full Content

Since their first use nearly fifteen years ago [1], microarray gene expression profiling experiments have become a ubiquitous tool in the study of disease. The vast number of gene transcripts assayed by modern microarrays (10 5 -10 6 ) has driven forward our understanding of biological processes tremendously, both by elucidating mechanisms at play in specific phenotypes and by revealing previously unknown regulatory mechanisms at 1 arXiv:1002.3946v2 [q-bio.QM] 17 Mar 2011 play in all cells. However, the high-dimensional data produced in these experiments-often comprising many more variables than samples and subject to noise-presents analytical challenges.

The analysis of gene expression data can be broadly grouped into two categories: the identification of differentially expressed genes (or gene-sets) between two or more known conditions, and the unsupervised identification (clustering) of samples or genes that exhibit similar profiles across the data set. In the former case, each gene is tested individually for association with the phenotype of interest, adjusting at the end for the vast number of genes probed. Pre-identified gene sets, such as those fulfilling a common biological function, may then be tested for an overabundance of differentially expressed genes (e.g., using gene set enrichment analysis [2]); this approach aids biological interpretability and improves the reproducibility of findings between microarray studies. In clustering, the hypothesis that functionally related genes and/or phenotypically similar samples will display correlated gene expression patterns motivates the search for groups of genes or samples with similar expression patterns. The most commonly used algorithms are hierarchical clustering [3], k-means clustering [4,5] and Self Organizing Maps [6]. A brief overview may be found in [7]. Of these, k-means appears to perform the best [7,8,9]. Relatedly, gene shaving [10] searches for clusters of genes showing both high variation across the samples and correlation across the genes, and several biclustering algorithms (such as [11]) search for class-conditional clusters of correlated genes. These methods are simple, visually appealing, and have identified a number of co-regulated genes and phenotype classes.

While approaches have been fruitful, they also have the potential to miss causative mechanisms that can be affected by a change in any one of several genes (such that no single alteration reaches significance) as well as mechanisms that require the concerted activity of multiple genes to produce a specific phenotype.

It is well known that complex diseases, such as cancers, exhibit considerable molecular heterogeneity for the above reasons [12]. As a result, individual genes may fail to reach significance, and lists of differentially expressed genes or gene signatures may have poor concordance across studies. Additionally, pathway analyses that rely on single-gene association statistics (such as GSEA [2]) may fail to identify causative mechanisms. For the same reasons, clustering algorithms that rely on linearly-separable clusters (and hence upon differential expression between the clusters) may fail to partition the samples in a manner that reflects the true underlying biology.

As an example of how causative genes can be missed in gene-centric analyses, consider a recent study in which gene expression profiles in the Wagyu cattle are compared to those of the double-muscled Piedmontese cattle [13]. The Piedmontese cattle’s muscular hypertrophy is attributable to a nonfunctional mutation of the myostatin gene (MSTN), but because MSTN itself is not differentially expressed between the two bovine models, its biological role cannot be inferred using traditional analyses of gene expression data. On the other hand, [13] showed that the functional MSTN variant was co-expressed with its regulatory target MYL2 in Wagyu cattle, whereas the nonfunctional variant in the Piedmontese cattle did not exhibit co-expression with MYL2. The correct identification of this system, in absence of differential expression at the gene level in MSTN or MYL2, is crucial to understanding the molecular determinants of the double-muscled phenotype. This example serves to underscore the pressing need for analysis methods that can reveal systems-level differences in cases and controls even when the constituent genes do not exhibit differential expression.

As an alternative approach, we propose here an analysis technique that is designed to reveal relationships between samples based on multi-gene expression profiles without requiring that the genes be differentially expressed (i.e., without requiring the samples to be linearly separable in the gene-expression space), and that has the power to reveal relationships between samples at various scales, permitting the identification of phenotypic subtypes. Our approach adapts a new unsupervised machine-learning technique, the Partition Decoupling Method (PDM) [14,

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut