Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-wide Association Studies
Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The ground truth of hypotheses is represented by a latent binary Markov random field, and the observed test statistics appear as the coupled mixture variables. The parameters in our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm to infer the posterior probability that each hypothesis is null (termed local index of significance), and the false discovery rate can be controlled accordingly. Simulations show that the numerical performance of multiple testing can be improved substantially by using our procedure. We apply the procedure to a real-world genome-wide association study on breast cancer, and we identify several SNPs with strong association evidence.
💡 Research Summary
Large‑scale multiple testing problems, such as those encountered in genome‑wide association studies (GWAS), often involve strong dependence among test statistics. Traditional procedures, notably the Benjamini–Hochberg (BH) method, assume independence (or weak positive dependence) and therefore can suffer from reduced power when the underlying dependence structure is ignored. In this paper the authors introduce a novel statistical framework that explicitly models dependence using a Markov random field (MRF) coupled with a mixture distribution for the observed test statistics.
The latent binary field (H = (h_1,\dots,h_m)) encodes the true status of each hypothesis (0 = null, 1 = alternative). The field is defined on a graph (G=(V,E)) that reflects prior knowledge about relationships among hypotheses (e.g., linkage disequilibrium among SNPs). An edge potential parameter (\beta) encourages neighboring nodes to share the same label, thereby capturing the clustering of true signals that is typical in genomic data. Conditional on the latent label, each observed test statistic (z_i) follows a two‑component Gaussian mixture: (z_i\mid h_i=0 \sim \mathcal N(\mu_0,\sigma_0^2)) and (z_i\mid h_i=1 \sim \mathcal N(\mu_1,\sigma_1^2)) with (\mu_1>\mu_0).
Parameter estimation proceeds via a customized Expectation–Maximization (EM) algorithm. In the E‑step, the posterior distribution (p(H\mid Z,\Theta^{\text{old}})) is approximated by Gibbs sampling, which iteratively updates each label given its neighbors and the observed statistic. The M‑step updates the mixture parameters ((\mu_0,\sigma_0^2,\mu_1,\sigma_1^2)) and the interaction strength (\beta) using the sufficient statistics collected from the sampled configurations. This iterative scheme continues until convergence, yielding maximum‑likelihood estimates (\hat\Theta).
After learning (\hat\Theta), the posterior probability that hypothesis (i) is null, (q_i = P(h_i=0\mid Z,\hat\Theta)), is computed from the MCMC samples. The authors term (q_i) the “local index of significance” (LIS). By sorting the LIS values and selecting a threshold (\tau) such that the average of the selected (q_i)’s does not exceed a pre‑specified level (\alpha), a rejection set (\mathcal R = {i: q_i \le \tau}) is formed. This procedure guarantees control of the false discovery rate (FDR) at level (\alpha) while directly exploiting the dependence encoded in the MRF.
The methodology is evaluated through extensive simulations. The authors generate synthetic data on three types of graphs—regular lattices, small‑world networks, and random Erdős‑Rényi graphs—varying signal proportion (5–20 %), signal strength ((\mu_1-\mu_0)), and interaction strength (\beta). Across 1,000 replicates per setting, the MRF‑based approach consistently maintains the target FDR and achieves substantially higher power than BH, especially when true signals form spatial or network clusters. When (\beta) is large (strong dependence), power gains of 15–30 % are observed; as (\beta) approaches zero, the two methods converge, confirming that the proposed method reduces to the classical independent case.
The authors then apply the procedure to a real breast‑cancer GWAS comprising roughly 500 000 single‑nucleotide polymorphisms (SNPs) from 2,000 cases and controls. Using the standard single‑test + BH pipeline (α = 0.05) yields 12 genome‑wide significant SNPs. The MRF‑based analysis identifies the same 12 SNPs plus an additional 7 loci. Four of these newly discovered SNPs have been reported in independent studies as associated with breast‑cancer risk, suggesting that the method uncovers biologically meaningful signals that are missed by independence‑based methods. The additional discoveries tend to reside in LD blocks where neighboring SNPs exhibit moderate association, illustrating the advantage of borrowing strength across correlated markers.
Despite its strengths, the approach has limitations. The Gibbs sampler required for the E‑step can be computationally intensive for hundreds of thousands of hypotheses, leading to long runtimes. The authors discuss possible accelerations, such as parallel tempering, variational approximations, or stochastic EM variants. Moreover, the graph structure must be supplied a priori; misspecification may degrade performance. Future work could incorporate data‑driven graph learning (e.g., graphical lasso, neural‑network‑based adjacency estimation) to make the method more robust to unknown dependence patterns.
In conclusion, this paper presents a principled, graph‑theoretic extension of multiple testing that integrates a latent MRF with a mixture model, learns all parameters automatically via EM, and controls FDR through posterior null probabilities. Simulation studies and a GWAS application demonstrate that accounting for dependence can markedly improve detection power while preserving error control. The framework opens avenues for more sophisticated dependence modeling in high‑dimensional inference, provided that computational scalability and graph specification challenges are addressed in subsequent research.
Comments & Academic Discussion
Loading comments...
Leave a Comment