A study of pre-validation

Reading time: 6 minute
...

📝 Original Info

  • Title: A study of pre-validation
  • ArXiv ID: 0807.4105
  • Date: 2008-07-28
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to competing predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to standard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightforward "one degree of freedom" analytical test from pre-validation can be biased and we propose a permutation test to remedy this problem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test.

💡 Deep Analysis

Deep Dive into A study of pre-validation.

Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to competing predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to standard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightforward “one degree of freedom” analytical test from pre-validation can be biased and we propose a permutation test to remedy this problem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test.

📄 Full Content

arXiv:0807.4105v1 [stat.AP] 25 Jul 2008 The Annals of Applied Statistics 2008, Vol. 2, No. 2, 643–664 DOI: 10.1214/07-AOAS152 c ⃝Institute of Mathematical Statistics, 2008 A STUDY OF PRE-VALIDATION By Holger H¨ofling1 and Robert Tibshirani2 Stanford University Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to com- peting predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to stan- dard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightfor- ward “one degree of freedom” analytical test from pre-validation can be biased and we propose a permutation test to remedy this prob- lem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test. 1. Introduction. Suppose that we have a prediction rule derived on a high-dimensional dataset. It is often of interest to compare the new predic- tion rule to competing rules in order to determine if the new rule provides any additional benefit. For example, the new prediction rule might be based on microarray expression values, while the competing predictors are clini- cal, nongenomic measurements. Doing the comparison between the new and competing rules on the same dataset (the “re-use” method) would favor the new rule as it was derived on this same dataset. Another approach would be to split the data into separate training and test datasets, build the predictor on the training set and then fit it along with competing predictors on the test set [see Chang et al. (2005) for an example]. However, with limited data, this may severely reduce the accuracy of the new prediction rule and/or the test set may be too small to have adequate power for the comparison. Pre-validation (PV) [see Tibshirani and Efron (2002)] offers another ap- proach to the problem of comparing a newly derived prediction rule to other Received July 2007; revised November 2007. 1Supported by an Albion Walter Hewlett Stanford Graduate Fellowship. 2Supported in part by NSF Grant DMS-99-71405 and NIH Contract N01-HV-28183. Key words and phrases. Cross-validation, hypothesis testing, point estimation, infer- ence, microarray. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2008, Vol. 2, No. 2, 643–664. This reprint differs from the original in pagination and typographic detail. 1 2 H. H ¨OFLING AND R. TIBSHIRANI Fig. 1. A schematic of the pre-validation process. The cases are divided up into (say) 10 equal-sized groups. Leaving out one of the groups, a prediction rule is derived from the data of the remaining 9 groups. This prediction rule is then applied to the left out group, giving the pre-validated predictor ˜y for the cases in the left out group. Repeating this process for every group yields the pre-validated predictor ˜y for all cases. Finally, ˜y is included in a logistic regression model together with the clinical predictors to assess its relative strength in predicting the outcome. predictors on the same dataset the new rule was derived on. Pre-validation is similar to cross-validation, but instead of directly estimating the predic- tion error, it constructs a “fairer” version of the predictions on the data. It uses a process similar to cross-validation to construct predictions for each sample, using training features for the other observations. Thus, the result of pre-validation is not an estimate of error (as in cross-validation), but rather a set of pre-validated predictions, one for each sample. These predictions do not have the inherent bias associated with the re-use method. Before going into more details, we explain how pre-validation works on an example (see also Figure 1). A STUDY OF PRE-VALIDATION 3 We have microarray data for n patients with breast cancer. On each array, measurements on p genes were taken. Also available are several nonmicroar- ray based predictors, which are commonly used in clinical practice (e.g., age, tumor size . . . ) to predict if the patient’s prognosis is poor or good. We want to use the microarray data in order to predict the prognosis of a patient. In PV, the n patients are divided into K-folds. Leaving out one fold, a predic- tion rule using the microarray data for the remaining K −1 folds is fit (the internal model). Using this rule, the cancer types for the patients in the left out fold are predicted. This way, the data of the left out fold is not used in building the rule and therefore no overfitting occurs. Repeating this proce- dure for every fold yields a vector of predictions, which we call pre-validated. The predicted response for a given patient derives from that patient’s covari- ates throug

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut