In an empirical Bayesian setting, we provide a new multiple testing method, useful when an additional covariate is available, that influences the probability of each null hypothesis being true. We measure the posterior significance of each test conditionally on the covariate and the data, leading to greater power. Using covariate-based prior information in an unsupervised fashion, we produce a list of significant hypotheses which differs in length and order from the list obtained by methods not taking covariate-information into account. Covariate-modulated posterior probabilities of each null hypothesis are estimated using a fast approximate algorithm. The new method is applied to expression quantitative trait loci (eQTL) data.
Deep Dive into Unsupervised empirical Bayesian multiple testing with external covariates.
In an empirical Bayesian setting, we provide a new multiple testing method, useful when an additional covariate is available, that influences the probability of each null hypothesis being true. We measure the posterior significance of each test conditionally on the covariate and the data, leading to greater power. Using covariate-based prior information in an unsupervised fashion, we produce a list of significant hypotheses which differs in length and order from the list obtained by methods not taking covariate-information into account. Covariate-modulated posterior probabilities of each null hypothesis are estimated using a fast approximate algorithm. The new method is applied to expression quantitative trait loci (eQTL) data.
arXiv:0807.4658v1 [stat.AP] 29 Jul 2008
The Annals of Applied Statistics
2008, Vol. 2, No. 2, 714–735
DOI: 10.1214/08-AOAS158
c
⃝Institute of Mathematical Statistics, 2008
UNSUPERVISED EMPIRICAL BAYESIAN MULTIPLE TESTING
WITH EXTERNAL COVARIATES
By Egil Ferkingstad,1 Arnoldo Frigessi, H˚avard Rue,
Gudmar Thorleifsson and Augustine Kong
University of Oslo and Centre for Integrative Genetics, (sfi)2—Statistics
for Innovation, Norwegian University of Science and Technology,
Decode Genetics and Decode Genetics
In an empirical Bayesian setting, we provide a new multiple test-
ing method, useful when an additional covariate is available, that
influences the probability of each null hypothesis being true. We mea-
sure the posterior significance of each test conditionally on the co-
variate and the data, leading to greater power. Using covariate-based
prior information in an unsupervised fashion, we produce a list of
significant hypotheses which differs in length and order from the list
obtained by methods not taking covariate-information into account.
Covariate-modulated posterior probabilities of each null hypothesis
are estimated using a fast approximate algorithm. The new method
is applied to expression quantitative trait loci (eQTL) data.
1. Introduction.
Science, industry and business possess the technology
to collect, store and distribute huge amounts of data efficiently and often
at low cost. Sensors and instrumentation, data logging capacity and com-
munication power have increased the breadth and depth of data. Systems
are measured more in detail, giving a more complete but complex picture
of processes and phenomena. Also, it is necessary to integrate many sources
of data of different type and quality. In high-throughput genomics, large
numbers of simultaneous comparisons are necessary to discover differentially
expressed genes among thirty thousand measured ones. Similarly, in finance,
one wishes to monitor prices of thousands of products and derivatives simul-
taneously to detect abnormal behavior, or in geophysics or brain imaging,
questioning thousands of 3D voxels about their properties. Such tests are
Received March 2007; revised January 2008.
1Supported by the National program for research in functional genomics in Norway
from the Research Council of Norway.
Key words and phrases. Bioinformatics, multiple hypothesis testing, false discovery
rates, data integration, empirical Bayes.
This is an electronic reprint of the original article published by the
Institute of Mathematical Statistics in The Annals of Applied Statistics,
2008, Vol. 2, No. 2, 714–735. This reprint differs from the original in pagination
and typographic detail.
1
2
FERKINGSTAD ET AL.
often dependent, and the dependency structure is ill specified, so that the ef-
fective number of independent tests is unknown. Sometimes, we expect that
only a small subset of decisions will have a positive result: the solution is then
sparse in the huge parameter space. To discover significant cases, it is neces-
sary to develop new methods that either exploit available a priori knowledge
on the structure of the solution, or merge different data sets, each adding
information. Benjamini and Hochberg (1995) proposed the false discovery
rate (FDR), which can adapt automatically to sparsity and has been shown
to be asymptotically optimal in a certain minimax sense [Abramovich et
al. (2006)]. FDR adjustments of p-values are nowadays routinely performed
on large scale multiple studies in many sciences and applied areas, from
astronomy [Miller et al. (2001)] to genomics [Tusher et al. (2001); from neu-
roimaging [Genovese, Lazar and Nichols (2002)] to industrial organization
[Brown et al. (2005)]. Bayesian approaches are based on the estimation of
the posterior probability of the null hypothesis. Efron et al. (2001) have de-
veloped the theory of the local false discovery rate, based on an estimation
procedure originally developed by Anderson and Blair (1982). As the FDR
provides a probability of misclassification for sets of tests called significant,
the posterior probability that the null hypothesis is true provides a similar
measure, but for a local set about the particular value of the test statistic.
Instead of summarizing the data by a test statistic, hierarchical Bayesian ap-
proaches have been developed that model parametrically the full measured
data [Baldi and Long (2001), Do, M¨uller and Tang (2005), Kendziorski et
al. (2006), Lonnstedt and Speed (2002), Newton et al. (2004), and Storey
(2007) also makes full use of the data in a hypothesis testing setting. Both
approaches have their strengths and weaknesses, in terms of validity of the
distributional assumptions under the alternative hypothesis, actual availabil-
ity of the full data, computational speed and simplicity of the methodology.
This paper assumes access to summary test statistics for every hypothesis
to be tested.
We propose a simple methodology which allows modulating the posterior
probability of
…(Full text truncated)…
This content is AI-processed based on ArXiv data.