We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted $t$-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James--Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package ``sda'' available from the R repository CRAN.
Deep Dive into Feature selection in omics prediction problems using cat scores and false nondiscovery rate control.
We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted $t$-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James–Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package ``sda’’ available from the R repository CRAN.
1. Introduction. Class prediction of biological samples based on their genetic or proteomic profile is now a routine task in genomic studies. Accordingly, many classification methods have been developed to address the specific statistical challenges presented by these data-see, for example, Schwender, Ickstadt and Rahnenführer (2008) and Slawski, Daumer and Boulesteix (2008) for recent reviews. In particular, the small sample size n renders difficult the training of the classifier, and the large number of variables p makes it hard to select suitable features for prediction.
Perhaps surprisingly, despite the many recent innovations in the field of classification methodology, including the introduction of sophisticated algorithms for support vector machines and the proposal of ensemble methods such as random forests, the conceptually simple approach of linear discriminant analysis (LDA) and its sibling, diagonal discriminant analysis (DDA), remain among the most effective procedures also in the domain of highdimensional prediction [Efron (2008a); Hand (2006); Efron (1975)].
In order to be applicable for high-dimensional analysis, it has been recognized early that regularization is essential [Friedman (1989)]. Specifically, when training the classifier, that is, when estimating the parameters of the discriminant function from training data, particular care needs to be taken to accurately infer the (inverse) covariance matrix. A rather radical, yet highly effective way to regularize covariance estimation in high dimensions is to set all correlations equal to zero [Bickel and Levina (2004)]. Employing a diagonal covariance matrix reduces LDA to the special case of diagonal discriminant analysis (DDA), also known in the machine learning community as “naive Bayes” classification.
In addition to facilitating high-dimensional estimation of the prediction function, DDA has one further key advantage: it is straightforward to conduct feature selection. In the DDA setting with two classes (K = 2), it can be shown that the optimal criterion for ordering features relevant for prediction are the t-scores between the two group means [e.g., Fan and Fan (2008)], or in the multiclass setting, the t-scores between group means and the overall centroid.
The nearest shrunken centroids (NSC) algorithm [Tibshirani et al. (2002[Tibshirani et al. ( , 2003))], commonly known by the name of “PAM” after its software implementation, is a regularized version of DDA with multiclass feature selection. The fact that PAM has established itself as one of the most popular methods for classification of gene expression data is ample proof that DDA-type procedures are indeed very effective for large-scale prediction problems-see also Bickel and Levina (2004) and Efron (2008a).
However, there are now many omics data sets where correlation among predictors is an essential feature of the data and hence cannot easily be ignored. For example, this includes proteomics, imaging, and metabolomics data where correlation among biomarkers is commonplace and induced by spatial dependencies and by chemical similarities, respectively. Furthermore, in many transcriptome measurements there are correlations among genes within a functional group or pathway [Ackermann and Strimmer (2009)].
Consequently, there have been several suggestions to generalize PAM to account for correlation. This includes the SCRDA [Guo, Hastie and Tibshirani (2007)], Clanc [Dabney and Storey (2007)] and MLDA [Xu, Brock and Parrish (2009)] approaches. All these methods are regularized versions of LDA, and hence offer automatic provisions for gene-wise correlations. However, in contrast to PAM and DDA, they lack an efficient and elegant feature selection scheme, due to problems with multiple optima in the choice of regularization parameters (SCRDA) and the large search space for optimal feature subsets (Clanc).
In this paper we present a framework for efficient high-dimensional LDA analysis. This is based on three cornerstones. First, we employ James-Stein shrinkage rules for training the classifier. All regularization parameters are estimated from the data in an analytic fashion without resorting to computationally expensive resampling. Second, we use correlation-adjusted t-scores (cat scores) for feature selection. These scores emerge from a restructured version of the LDA equations and enable simple and effective ranking of genes even in the presence of correlation. Third, we employ false nondiscovery rate thresholding for selecting features for inclusion in the prediction rule. As we will show below, this is a highly effective method with similar performance to the recently proposed “higher criticism” approach.
The remainder of the paper is organized as follows. In Sections 2-5 we detail our framework for shrinkage discriminant analysis and variable selection. Subsequently, we demonstrate the effectiveness of our approach by application to a number of high-dimensional genomic data
…(Full text truncated)…
This content is AI-processed based on ArXiv data.