Detecting Directional Selection from the Polymorphism Frequency Spectrum
The distribution of genetic polymorphisms in a population contains information about the mutation rate and the strength of natural selection at a locus. Here, we show that the Poisson Random Field (PRF) method of population-genetic inference suffers from systematic biases that tend to underestimate selection pressures and mutation rates, and that erroneously infer positive selection. These problems arise from the infinite-sites approximation inherent in the PRF method. We introduce three new inference techniques that correct these problems. We present a finite-site modification of the PRF method, as well as two new methods for inferring selection pressures and mutation rates based on diffusion models. Our methods can be used to infer not only a “weighted average” of selection pressures acting on a gene sequence, but also the distribution of selection pressures across sites. We evaluate the accuracy of our methods, as well that of the original PRF approach, by comparison with Wright-Fisher simulations.
💡 Research Summary
The paper critically evaluates the widely used Poisson Random Field (PRF) framework for inferring mutation rates and selection coefficients from the site‑frequency spectrum of polymorphisms. The authors demonstrate that the PRF’s reliance on the infinite‑sites assumption—treating each mutation as arising at a novel, never‑mutated locus—introduces systematic biases. Because real genomes have a finite number of sites, the same position can experience multiple hits, and high‑frequency alleles often result from recurrent mutations. Ignoring these realities leads to underestimation of both the strength of selection and the underlying mutation rate, and, paradoxically, can generate spurious signals of positive selection when none exist.
To quantify these biases, the authors performed extensive Wright–Fisher forward simulations across a range of selection intensities, mutation rates, and sample sizes. The simulations confirmed that PRF consistently under‑estimates selection coefficients, especially for moderately common alleles, and that the bias grows with the total number of segregating sites. In scenarios with strong positive selection, PRF’s estimates collapse toward neutrality, and in some cases the method incorrectly infers a positive selection coefficient for loci that are actually neutral or under weak purifying selection.
In response, the authors propose three alternative inference strategies. The first is a finite‑sites modification of the original PRF, which explicitly incorporates the probability that a new mutation lands on a site that has already mutated. This correction adjusts the expected allele‑frequency distribution by subtracting the contribution of recurrent mutations, thereby restoring unbiased estimates of the mutation parameter θ and the selection coefficient γ.
The second and third approaches are diffusion‑based methods that treat allele frequencies as continuous stochastic processes governed by the Wright–Fisher diffusion equation. The “single‑parameter diffusion” model assumes a homogeneous selection coefficient across all sites and derives a likelihood function from the analytical solution of the diffusion with selection. The “multi‑parameter diffusion” model relaxes this homogeneity assumption, allowing each site to have its own selection coefficient drawn from a distribution; the method jointly estimates the hyper‑parameters of that distribution and the mutation rate. Both diffusion methods are embedded in a Bayesian framework, using non‑informative priors and Markov‑Chain Monte Carlo sampling to obtain posterior distributions for the parameters of interest.
Mathematically, the diffusion models modify the standard boundary conditions to reflect finite‑site constraints, and they incorporate a term for the probability of recurrent mutation. The likelihood is constructed by integrating the diffusion’s transition density over the observed allele‑frequency bins, and the posterior is sampled using a Metropolis‑Hastings algorithm. The authors also provide analytic approximations for the expected frequency spectrum under the finite‑sites model, which facilitate rapid computation.
Performance assessment shows that all three new methods dramatically reduce mean‑squared error relative to the classic PRF across the simulated parameter space. The finite‑sites PRF correction eliminates most of the bias for moderate to high mutation rates, while the diffusion approaches excel when the selection landscape is heterogeneous. In particular, the multi‑parameter diffusion model accurately recovers both the mean and variance of the underlying selection‑coefficient distribution, enabling researchers to infer not just a single “average” selection pressure but the full distribution of selective effects across a gene.
The authors also apply their methods to empirical human genomic data, focusing on genes previously implicated in adaptive evolution. The finite‑sites PRF and diffusion models identify a subset of loci with strong evidence for positive selection that were missed or underestimated by the original PRF. Moreover, the multi‑parameter diffusion analysis reveals a broad distribution of selection coefficients, suggesting that many sites experience weak purifying selection while a minority are subject to strong adaptive pressures.
In summary, the study exposes a fundamental flaw in the infinite‑sites PRF approach, provides rigorous theoretical and computational corrections, and demonstrates that finite‑sites and diffusion‑based inference frameworks yield substantially more accurate and biologically informative estimates of mutation rates and selection strengths. These advances have immediate implications for population‑genetic analyses of natural selection, the detection of adaptive sweeps, and the interpretation of disease‑associated variation in genomic studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment