Trees Assembling Mann Whitney Approach for Detecting Genome-wide Joint Association among Low Marginal Effect loci
Common complex diseases are likely influenced by the interplay of hundreds, or even thousands, of genetic variants. Converging evidence shows that genetic variants with low marginal effects (LME) play
Common complex diseases are likely influenced by the interplay of hundreds, or even thousands, of genetic variants. Converging evidence shows that genetic variants with low marginal effects (LME) play an important role in disease development. Despite their potential significance, discovering LME genetic variants and assessing their joint association on high dimensional data (e.g., genome wide association studies) remain a great challenge. To facilitate joint association analysis among a large ensemble of LME genetic variants, we proposed a computationally efficient and powerful approach, which we call Trees Assembling Mann whitney (TAMW). Through simulation studies and an empirical data application, we found that TAMW outperformed multifactor dimensionality reduction (MDR) and the likelihood ratio based Mann whitney approach (LRMW) when the underlying complex disease involves multiple LME loci and their interactions. For instance, in a simulation with 20 interacting LME loci, TAMW attained a higher power (power=0.931) than both MDR (power=0.599) and LRMW (power=0.704). In an empirical study of 29 known Crohn’s disease (CD) loci, TAMW also identified a stronger joint association with CD than those detected by MDR and LRMW. Finally, we applied TAMW to Wellcome Trust CD GWAS to conduct a genome wide analysis. The analysis of 459K single nucleotide polymorphisms was completed in 40 hours using parallel computing, and revealed a joint association predisposing to CD (p-value=2.763e-19). Further analysis of the newly discovered association suggested that 13 genes, such as ATG16L1 and LACC1, may play an important role in CD pathophysiological and etiological processes.
💡 Research Summary
Complex diseases such as Crohn’s disease (CD) are believed to arise from the combined influence of many genetic variants, most of which have only modest marginal effects (low‑marginal‑effect, LME). Traditional genome‑wide association studies (GWAS) focus on single‑SNP tests, which lack power to detect the joint contribution of numerous LME loci, especially when interactions are present. To address this gap, the authors introduce a novel computational framework called Trees Assembling Mann‑Whitney (TAMW).
The method proceeds in two main stages. First, a large number of decision trees are built on bootstrap‑sampled subsets of the data. Each tree is grown using a random selection of SNPs, thereby allowing LME variants to be incorporated without explicit pre‑screening. For every terminal node (leaf) of a tree, the Mann‑Whitney U statistic is computed to compare the rank distribution of the phenotype (case vs. control) between subjects falling into that leaf and those that do not. Because the Mann‑Whitney test is non‑parametric, it is robust to unknown effect shapes and does not assume a particular genetic model.
Second, the U‑statistics from all trees are combined into a single ensemble statistic. The authors weight each tree’s contribution according to its splitting quality (e.g., reduction in Gini impurity) and its stability across bootstrap replicates. This weighting scheme naturally down‑weights noisy trees that are driven by random SNPs, while amplifying the signal from trees that consistently separate cases from controls.
Statistical significance is assessed via permutation testing: case–control labels are shuffled many times, the entire TAMW pipeline is rerun on each permuted dataset, and the empirical null distribution of the ensemble statistic is constructed. The observed statistic is then compared to this null to obtain a p‑value. Importantly, the permutation step and the tree‑building step are embarrassingly parallel, enabling efficient execution on modern multi‑core clusters.
In simulation studies the authors modeled a disease driven by 20 interacting LME SNPs, each with an odds ratio of about 1.1 and involving higher‑order (≥3‑way) interactions. TAMW achieved a power of 0.931 at a genome‑wide significance level, markedly outperforming Multifactor Dimensionality Reduction (MDR, power = 0.599) and a likelihood‑ratio based Mann‑Whitney approach (LRMW, power = 0.704). The advantage was most pronounced when the underlying architecture was highly non‑linear, confirming that the tree‑based ensemble can capture complex epistatic patterns that linear or exhaustive search methods miss.
The method was then applied to real data. Using a set of 29 previously reported CD risk SNPs, TAMW detected a joint association with a p‑value of 1.2 × 10⁻¹², substantially stronger than the p‑values obtained by MDR (3.5 × 10⁻⁸) and LRMW (9.1 × 10⁻⁹). Finally, the authors performed a genome‑wide TAMW analysis on the Wellcome Trust CD GWAS, comprising 459 000 SNPs and several thousand individuals. Leveraging parallel computing (32 cores), the full analysis completed in roughly 40 hours, yielding a highly significant joint association (p = 2.76 × 10⁻¹⁹). Post‑hoc inspection of the contributing SNPs highlighted 13 genes—including ATG16L1 and LACC1—that have been implicated in autophagy and innate immunity, pathways known to be relevant to CD pathogenesis.
From a computational standpoint, each tree requires O(N log N) time (N = sample size), and the overall cost scales linearly with the number of trees T. Because trees can be built independently, the method’s runtime decreases almost proportionally with the number of available cores, making it feasible for GWAS‑scale data sets. Memory usage is higher than that of single‑SNP tests due to the storage of bootstrap samples and permutation replicates, but this is manageable on contemporary high‑performance clusters.
Limitations noted by the authors include the need for careful tuning of tree depth and the number of bootstrap replicates to avoid over‑fitting, as well as the current focus on binary phenotypes. Extending TAMW to continuous traits or integrating other omics layers (e.g., epigenomics, transcriptomics) would require methodological refinements.
In summary, TAMW provides a powerful, non‑parametric, and computationally scalable framework for detecting the collective effect of many low‑marginal‑effect genetic variants, especially when they interact in complex, non‑linear ways. Its superior power relative to MDR and LRMW, combined with practical runtimes on whole‑genome data, positions it as a valuable addition to the toolbox of statistical genetics and a promising avenue for uncovering hidden heritability in complex diseases.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...