The distribution of genetic polymorphisms in a population contains information about the mutation rate and the strength of natural selection at a locus. Here, we show that the Poisson Random Field (PRF) method of population-genetic inference suffers from systematic biases that tend to underestimate selection pressures and mutation rates, and that erroneously infer positive selection. These problems arise from the infinite-sites approximation inherent in the PRF method. We introduce three new inference techniques that correct these problems. We present a finite-site modification of the PRF method, as well as two new methods for inferring selection pressures and mutation rates based on diffusion models. Our methods can be used to infer not only a "weighted average" of selection pressures acting on a gene sequence, but also the distribution of selection pressures across sites. We evaluate the accuracy of our methods, as well that of the original PRF approach, by comparison with Wright-Fisher simulations.
Deep Dive into Detecting Directional Selection from the Polymorphism Frequency Spectrum.
The distribution of genetic polymorphisms in a population contains information about the mutation rate and the strength of natural selection at a locus. Here, we show that the Poisson Random Field (PRF) method of population-genetic inference suffers from systematic biases that tend to underestimate selection pressures and mutation rates, and that erroneously infer positive selection. These problems arise from the infinite-sites approximation inherent in the PRF method. We introduce three new inference techniques that correct these problems. We present a finite-site modification of the PRF method, as well as two new methods for inferring selection pressures and mutation rates based on diffusion models. Our methods can be used to infer not only a “weighted average” of selection pressures acting on a gene sequence, but also the distribution of selection pressures across sites. We evaluate the accuracy of our methods, as well that of the original PRF approach, by comparison with Wright-Fis
The mutation rate and selection pressures operating on genes are of central importance in shaping their evolution. The number and frequency distribution of genetic polymorphisms within a population carry information about these fundamental processes. Polymorphisms at higher frequencies reflect weaker selective pressures (or positive selection), and vice versa.
Similarly, a larger number of polymorphisms indicates a higher mutation rate. Thus we can use the polymorphism frequency spectrum observed in genetic sequences sampled from a population in order to infer the mutation rate and the strength and direction of selection acting on the sequence. This intuition can be formalized into a rigorous method for estimating selection pressures and mutation rates by calculating the likelihood of sampled polymorphism data as a function of these parameters. The Poisson Random Field (PRF) model provides an important and widely-used method of doing so. The PRF model assumes a panmictic population of constant size, free recombination, infinite sites, no dominance or epistasis, and equal selection pressures at all sites. Under these assumptions, Sawyer and Hartl (1992) showed that the distribution of frequencies of mutant lineages in a population forms a Poisson random field whose properties depend on the selection pressure and the mutation rate. Hartl et al. (1994) and Bustamante et al. (2001) developed a maximum likelihood method of estimating these parameters from data on the polymorphism frequency spectrum. This method has been widely used to study, for example, purifying selection on synonymous (Akashi, 1999;Akashi and Schaeffer, 1997;Hartl et al., 1994) and nonsynonymous (Akashi, 1999;Hartl et al., 1994) variation, and the evolution of base composition (Galtier et al., 2006;Lercher et al., 2002).
Closely related to these analyses of polymorphism data are methods that calculate, based on the PRF model, the ratio of the expected number of polymorphisms within species to divergence between species for synonymous and nonsynonymous sites (using the idea behind the McDonald-Kreitman test (McDonald and Kreitman, 1991)). These methods discard some of the available data, as they depend only on the number of polymorphisms and not their full frequency spectrum. However, they are also less sensitive to assumptions (Loewe et al., 2006;Sawyer and Hartl, 1992). Such methods have been applied to estimate selection pressures on synonymous variation (Akashi, 1995), on nonsynonymous mutations in mitochondrial genomes (Nachman, 1998;Rand and Kann, 1998;Weinreich and Rand, 2000), and on nonsynonymous variation in a variety of nuclear genomes (Bartolome et al., 2005;Bustamante et al., 2002;Sawyer et al., 2003), including humans (Bustamante et al., 2005).
Much recent theoretical work has focused on relaxing various assumptions of the original PRF method. These include allowing for dominance (Williamson et al., 2004), population subdivision (Wakeley, 2003), changing population size (Williamson et al., 2005), and linkage between sites (Zhu and Bustamante, 2005). Several methods for studying the properties of the distribution of selection pressures across sites based on the PRF model have also been developed, both using the polymorphism frequency spectrum (Bustamante et al., 2003), and using the ratio of polymorphism to divergence (Bustamante et al., 2003;Loewe et al., 2006;Piganeau and Eyre-Walker, 2003;Sawyer et al., 2003).
All of the methods summarized above fall within the PRF framework, and therefore depend on the infinite-sites approximation. Rather than calculating the evolutionary dynamics at each site, these methods consider the overall steady-state distribution of mutant lineage frequencies across all sites. The PRF method is applied to data by assuming that each lineage segregates at a different site. The infinite-sites assumption is made for purely technical, as opposed to biological, reasons. In this paper, we show that in biologically relevant parameter regimes the infinite-sites assumption causes the PRF method to underestimate selection pressures and mutation rates. This problem arises both for inferences based on the polymorphism frequency spectrum and for inferences based on the ratio of within-species polymorphism to between-species divergence, but in this paper we focus exclusively on the former. As we demonstrate below, the PRF method often underestimates the selection pressure and the mutation rate by as much as an order of magnitude. In addition, and perhaps of greater concern, the PRF method frequently infers that a gene is under strong positive selection when in fact the gene is experiencing weak negative selection (Fig. 1).
In this paper, we present three methods to correct the systematic biases of the PRF method, each with their own advantages and drawbacks. Rather than study mutant lineages across a sequence, our methods all focus on explicit models of the evolutionary dynamics at individual sites. We first pre
…(Full text truncated)…
This content is AI-processed based on ArXiv data.