Testing significance of features by lassoed principal components

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Testing significance of features by lassoed principal components
ArXiv ID: 0811.1700
Date: 2008-11-12
Authors: ** - Daniela M. Witten (Stanford University) - Robert Tibshirani (Stanford University) **

📝 Abstract

We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample $t$-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an $L_1$ penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

💡 Deep Analysis

Deep Dive into Testing significance of features by lassoed principal components.

📄 Full Content

arXiv:0811.1700v1 [stat.AP] 11 Nov 2008 The Annals of Applied Statistics 2008, Vol. 2, No. 3, 986–1012 DOI: 10.1214/08-AOAS182 c ⃝Institute of Mathematical Statistics, 2008 TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS By Daniela M. Witten1 and Robert Tibshirani2 Stanford University We consider the problem of testing the signiﬁcance of features in high-dimensional settings. In particular, we test for diﬀerentially- expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Princi- pal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two- sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene ex- pression data covariance matrix and then applying an L1 penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying sig- niﬁcant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this ﬂexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identiﬁcation of signiﬁcant features. 1. Introduction. In recent years new experimental technologies within the ﬁeld of biology have led to data sets in which the number of features p greatly exceeds the number of observations n. Two such examples are gene expression data and data from genome-wide association studies. In the case of gene expression (or microarray) data, it is often of interest to identify genes that are diﬀerentially-expressed across conditions (for instance, some patients may have cancer and others may not), or that are associated with some type of outcome, such as survival time. Such genes might be used as Received February 2008; revised May 2008. 1Supported by a National Defense Science and Engineering Graduate Fellowship. 2Supported in part by NSF Grant DMS-99-71405 and the National Institutes of Health Contract N01-HV-28183. Key words and phrases. Microarray, gene expression, multiple testing, feature selec- tion. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2008, Vol. 2, No. 3, 986–1012. This reprint diﬀers from the original in pagination and typographic detail. 1 2 D. M. WITTEN AND R. TIBSHIRANI features in a model for prediction or classiﬁcation of the outcome in a new patient, or they might be used as target genes for further experiments in order to better understand the biological processes that contribute to the outcome. Over the years, a number of methods have been developed for the iden- tiﬁcation of diﬀerentially-expressed genes in a microarray experiment; for a review, see Cui and Churchill (2003) or Allison et al (2006). Usually, the as- sociation between a given gene and the outcome is measured using a statistic that is a function of that gene only. Genes for which the statistic exceeds a given value are considered to be diﬀerentially-expressed. Many methods combine information, or borrow strength, across genes in order to make a more informed assessment of the signiﬁcance of a given gene. In the case of a two-class outcome, such methods include the shrunken variance estimates of Cui et al. (2005), the empirical Bayes approach of Lonnstedt and Speed (2002), the limma procedure of Smyth (2004) and the optimal discovery procedure (ODP) of Storey, Dai and Leek (2007). We elaborate on the lat- ter two procedures, as we will compare them to our method throughout the paper in the case of a two-class outcome. Limma assesses diﬀerential expression between conditions by forming a moderated t-statistic in which posterior residual standard deviations are used instead of the usual standard deviation. The ODP approach is quite diﬀerent; it involves a generalization of the likelihood ratio statistic such that an individual gene’s signiﬁcance is computed as a function of all of the genes in the data set. In the case of a survival outcome, a standard method to assess a gene’s signiﬁcance (and the one to which we will compare our proposed method in this paper) is the Cox score; see, for example, Beer et al. (2002) and Bair and Tibshirani (2004). We propose a new method, called Lassoed Principal Components (LPC), for the identiﬁcation of diﬀerentially-expressed genes. The LPC method is as follows. First, we compute scores for each gene using an existing method, such as those mentioned above. These scores are then regressed onto the eigenarrays of the data [Alter, Brown and Botstein (2000)]—principal com- ponents of length equal to the numb

…(Full text truncated)…

🇰🇷 이 논문을 한글로 읽기

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on ArXiv data.

Testing significance of features by lassoed principal components

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

Modelling interest rates by correlated multi-factor CIR-like processes

On spatial extremes: with application to a rainfall problem

Scientists who engage with society perform better academically

Start searching

No results found