A model selection approach to genome wide association studies

Reading time: 6 minute
...

📝 Original Info

  • Title: A model selection approach to genome wide association studies
  • ArXiv ID: 1010.0124
  • Date: 2023-06-15
  • Authors: : John Doe, Jane Smith, Robert Johnson

📝 Abstract

For the vast majority of genome wide association studies (GWAS) published so far, statistical analysis was performed by testing markers individually. In this article we present some elementary statistical considerations which clearly show that in case of complex traits the approach based on multiple regression or generalized linear models is preferable to multiple testing. We introduce a model selection approach to GWAS based on modifications of Bayesian Information Criterion (BIC) and develop some simple search strategies to deal with the huge number of potential models. Comprehensive simulations based on real SNP data confirm that model selection has larger power than multiple testing to detect causal SNPs in complex models. On the other hand multiple testing has substantial problems with proper ranking of causal SNPs and tends to detect a certain number of false positive SNPs, which are not linked to any of the causal mutations. We show that this behavior is typical in GWAS for complex traits and can be explained by an aggregated influence of many small random sample correlations between genotypes of a SNP under investigation and other causal SNPs. We believe that our findings at least partially explain problems with low power and nonreplicability of results in many real data GWAS. Finally, we discuss the advantages of our model selection approach in the context of real data analysis, where we consider publicly available gene expression data as traits for individuals from the HapMap project.

💡 Deep Analysis

Figure 1

📄 Full Content

Within the last five years genome wide association studies (GWAS) have become an important tool for genetic scientists. There exist several excellent reviews which elucidate the statistical intricacies involved, see e.g. [5,30,35,49]. The major goal of GWAS is to detect association between some trait (either quantitative or categorical) and some genetic markers. The most commonly used type of markers are single nucleotide polymorphisms (SNPs). Current SNP array technology allows to determine the state of up to one million SNPs within a single experiment.

The huge number of markers leads to a multiple testing problem which has been extensively discussed in the literature (see the discussion on this topic in [49]). It is common practice in applied papers on GWAS to report single marker tests of SNPs. For a review on statistical tests for case control studies we refer again to [49]. Recommended significance levels to control family wise error are as small as α = 5 • 10 -8 [23], though occasionally larger significance levels like α = 10 -6 are used. It is well understood that due to positive correlations between markers a simple Bonferroni correction is likely to be too conservative. Approaches to deal with correlation between SNPs include permutation tests like in [47] or the use of Hidden Markov Models [48].

Most GWAS are performed as case control studies, though recently there has been growing interest in GWAS for quantitative traits (QT) [43]. Statistical analysis performed for QT tends to make use of regression models to correct for covariates like sex or age; however, each SNP is then tested individually at a significance level accounting for multiple testing (see for example [21,28,37,39,42] etc.).

In the fairly related area of QTL mapping based on designed breeding experiments the search over single markers was abandoned already quite a while ago in favor of multi marker models. In this context the problem of selection of significant markers is equivalent to the choice of the “best” multiple regression or generalized linear model. This task is however rather difficult due to the large number of potential regressors. Specifically, in [16] it was noticed that classical model selection criteria like Akaike Information Criterion (AIC, [2]), and even Bayesian Information Criterion (BIC, [44]), tend to select too many markers. Addressing this problem [12] introduced a modified version of BIC (mBIC), suited for the situation where a large number of markers is searched, but only relatively few markers are expected to be true signals. mBIC was motivated in a Bayesian setting, using informative priors on the model dimension, which prefer rather small models. In [11] and [15] it was observed that mBIC penalty is closely related to the multiple regression version of the Bonferroni correction for multiple testing. In a series of papers [3,4,11,15,24,51] based on simulation studies good properties of mBIC were documented. Recently some asymptotic optimality properties of mBIC have been shown [10,26].

In this article we will adopt a model selection approach for GWAS, which will be introduced in Section 2. For the ease of presentation we will restrict our discussion mainly to linear regression models, though qualitatively similar results will hold for generalized linear models. In Section 2.1 we present basic statistical considerations, which clearly demonstrate the advantage of using multiple regression over the search over single markers. In this section we specifically stress the lower power of single marker tests. This results from an inflated residual sum of squares, which incorporates the influence of all causal genes that are not included in the model. In Section 2.2 we introduce and motivate our particular choice of model selection criteria for the high dimensional multiple regression we are dealing with. Apart from mBIC we consider a second modification of BIC, mBIC2, which according to [26] is closely related to the multiple regression version of the Benjamini-Hochberg procedure [8] for multiple testing. According to [26] mBIC2 has asymptotic optimality properties in a wider range of sparsity parameters than mBIC. The greatest challenge in applying model selection to GWAS is the huge search space of potential models. In Section 2.3 we describe some model search strategies which are particularly suited to the situation where we have a huge number of markers but expect only a small number of them to be strongly associated with the trait.

In Section 3 we perform a simulation study based on actual SNP data. Apart from the expected result that model selection strategies outperform multiple testing procedures the simulation study provides some rather surprising insights. In particular we will see that for multiple testing procedures rather small random correlations between causal SNPs have drastic influence on the order of p-values. This results in a low power to detect some important causal mutations, as well

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut