Testing significance of features by lassoed principal components

Reading time: 6 minute
...

📝 Original Info

  • Title: Testing significance of features by lassoed principal components
  • ArXiv ID: 0811.1700
  • Date: 2008-11-12
  • Authors: ** - Daniela M. Witten (Stanford University) - Robert Tibshirani (Stanford University) **

📝 Abstract

We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample $t$-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an $L_1$ penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

💡 Deep Analysis

Deep Dive into Testing significance of features by lassoed principal components.

We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample $t$-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an $L_1$ penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventiona

📄 Full Content

arXiv:0811.1700v1 [stat.AP] 11 Nov 2008 The Annals of Applied Statistics 2008, Vol. 2, No. 3, 986–1012 DOI: 10.1214/08-AOAS182 c ⃝Institute of Mathematical Statistics, 2008 TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS By Daniela M. Witten1 and Robert Tibshirani2 Stanford University We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially- expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Princi- pal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two- sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene ex- pression data covariance matrix and then applying an L1 penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying sig- nificant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features. 1. Introduction. In recent years new experimental technologies within the field of biology have led to data sets in which the number of features p greatly exceeds the number of observations n. Two such examples are gene expression data and data from genome-wide association studies. In the case of gene expression (or microarray) data, it is often of interest to identify genes that are differentially-expressed across conditions (for instance, some patients may have cancer and others may not), or that are associated with some type of outcome, such as survival time. Such genes might be used as Received February 2008; revised May 2008. 1Supported by a National Defense Science and Engineering Graduate Fellowship. 2Supported in part by NSF Grant DMS-99-71405 and the National Institutes of Health Contract N01-HV-28183. Key words and phrases. Microarray, gene expression, multiple testing, feature selec- tion. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2008, Vol. 2, No. 3, 986–1012. This reprint differs from the original in pagination and typographic detail. 1 2 D. M. WITTEN AND R. TIBSHIRANI features in a model for prediction or classification of the outcome in a new patient, or they might be used as target genes for further experiments in order to better understand the biological processes that contribute to the outcome. Over the years, a number of methods have been developed for the iden- tification of differentially-expressed genes in a microarray experiment; for a review, see Cui and Churchill (2003) or Allison et al (2006). Usually, the as- sociation between a given gene and the outcome is measured using a statistic that is a function of that gene only. Genes for which the statistic exceeds a given value are considered to be differentially-expressed. Many methods combine information, or borrow strength, across genes in order to make a more informed assessment of the significance of a given gene. In the case of a two-class outcome, such methods include the shrunken variance estimates of Cui et al. (2005), the empirical Bayes approach of Lonnstedt and Speed (2002), the limma procedure of Smyth (2004) and the optimal discovery procedure (ODP) of Storey, Dai and Leek (2007). We elaborate on the lat- ter two procedures, as we will compare them to our method throughout the paper in the case of a two-class outcome. Limma assesses differential expression between conditions by forming a moderated t-statistic in which posterior residual standard deviations are used instead of the usual standard deviation. The ODP approach is quite different; it involves a generalization of the likelihood ratio statistic such that an individual gene’s significance is computed as a function of all of the genes in the data set. In the case of a survival outcome, a standard method to assess a gene’s significance (and the one to which we will compare our proposed method in this paper) is the Cox score; see, for example, Beer et al. (2002) and Bair and Tibshirani (2004). We propose a new method, called Lassoed Principal Components (LPC), for the identification of differentially-expressed genes. The LPC method is as follows. First, we compute scores for each gene using an existing method, such as those mentioned above. These scores are then regressed onto the eigenarrays of the data [Alter, Brown and Botstein (2000)]—principal com- ponents of length equal to the numb

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut