Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to competing predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to standard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightforward "one degree of freedom" analytical test from pre-validation can be biased and we propose a permutation test to remedy this problem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test.
Deep Dive into A study of pre-validation.
Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to competing predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to standard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightforward “one degree of freedom” analytical test from pre-validation can be biased and we propose a permutation test to remedy this problem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test.
arXiv:0807.4105v1 [stat.AP] 25 Jul 2008
The Annals of Applied Statistics
2008, Vol. 2, No. 2, 643–664
DOI: 10.1214/07-AOAS152
c
⃝Institute of Mathematical Statistics, 2008
A STUDY OF PRE-VALIDATION
By Holger H¨ofling1 and Robert Tibshirani2
Stanford University
Given a predictor of outcome derived from a high-dimensional
dataset, pre-validation is a useful technique for comparing it to com-
peting predictors on the same dataset. For microarray data, it allows
one to compare a newly derived predictor for disease outcome to stan-
dard clinical predictors on the same dataset. We study pre-validation
analytically to determine if the inferences drawn from it are valid. We
show that while pre-validation generally works well, the straightfor-
ward “one degree of freedom” analytical test from pre-validation can
be biased and we propose a permutation test to remedy this prob-
lem. In simulation studies, we show that the permutation test has the
nominal level and achieves roughly the same power as the analytical
test.
1. Introduction.
Suppose that we have a prediction rule derived on a
high-dimensional dataset. It is often of interest to compare the new predic-
tion rule to competing rules in order to determine if the new rule provides
any additional benefit. For example, the new prediction rule might be based
on microarray expression values, while the competing predictors are clini-
cal, nongenomic measurements. Doing the comparison between the new and
competing rules on the same dataset (the “re-use” method) would favor the
new rule as it was derived on this same dataset. Another approach would be
to split the data into separate training and test datasets, build the predictor
on the training set and then fit it along with competing predictors on the
test set [see Chang et al. (2005) for an example]. However, with limited data,
this may severely reduce the accuracy of the new prediction rule and/or the
test set may be too small to have adequate power for the comparison.
Pre-validation (PV) [see Tibshirani and Efron (2002)] offers another ap-
proach to the problem of comparing a newly derived prediction rule to other
Received July 2007; revised November 2007.
1Supported by an Albion Walter Hewlett Stanford Graduate Fellowship.
2Supported in part by NSF Grant DMS-99-71405 and NIH Contract N01-HV-28183.
Key words and phrases. Cross-validation, hypothesis testing, point estimation, infer-
ence, microarray.
This is an electronic reprint of the original article published by the
Institute of Mathematical Statistics in The Annals of Applied Statistics,
2008, Vol. 2, No. 2, 643–664. This reprint differs from the original in pagination
and typographic detail.
1
2
H. H ¨OFLING AND R. TIBSHIRANI
Fig. 1.
A schematic of the pre-validation process. The cases are divided up into (say)
10 equal-sized groups. Leaving out one of the groups, a prediction rule is derived from
the data of the remaining 9 groups. This prediction rule is then applied to the left out
group, giving the pre-validated predictor ˜y for the cases in the left out group. Repeating
this process for every group yields the pre-validated predictor ˜y for all cases. Finally, ˜y
is included in a logistic regression model together with the clinical predictors to assess its
relative strength in predicting the outcome.
predictors on the same dataset the new rule was derived on. Pre-validation
is similar to cross-validation, but instead of directly estimating the predic-
tion error, it constructs a “fairer” version of the predictions on the data. It
uses a process similar to cross-validation to construct predictions for each
sample, using training features for the other observations. Thus, the result of
pre-validation is not an estimate of error (as in cross-validation), but rather
a set of pre-validated predictions, one for each sample. These predictions do
not have the inherent bias associated with the re-use method. Before going
into more details, we explain how pre-validation works on an example (see
also Figure 1).
A STUDY OF PRE-VALIDATION
3
We have microarray data for n patients with breast cancer. On each array,
measurements on p genes were taken. Also available are several nonmicroar-
ray based predictors, which are commonly used in clinical practice (e.g., age,
tumor size . . . ) to predict if the patient’s prognosis is poor or good. We want
to use the microarray data in order to predict the prognosis of a patient. In
PV, the n patients are divided into K-folds. Leaving out one fold, a predic-
tion rule using the microarray data for the remaining K −1 folds is fit (the
internal model). Using this rule, the cancer types for the patients in the left
out fold are predicted. This way, the data of the left out fold is not used in
building the rule and therefore no overfitting occurs. Repeating this proce-
dure for every fold yields a vector of predictions, which we call pre-validated.
The predicted response for a given patient derives from that patient’s covari-
ates throug
…(Full text truncated)…
This content is AI-processed based on ArXiv data.