Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach
ArXiv ID: 0712.1943
Date: 2007-12-12
Authors: Marc Coram, Hua Tang

📝 Abstract

Estimation of the allele frequency at genetic markers is a key ingredient in biological and biomedical research, such as studies of human genetic variation or of the genetic etiology of heritable traits. As genetic data becomes increasingly available, investigators face a dilemma: when should data from other studies and population subgroups be pooled with the primary data? Pooling additional samples will generally reduce the variance of the frequency estimates; however, used inappropriately, pooled estimates can be severely biased due to population stratification. Because of this potential bias, most investigators avoid pooling, even for samples with the same ethnic background and residing on the same continent. Here, we propose an empirical Bayes approach for estimating allele frequencies of single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. In every example we have considered, our estimator achieves a mean squared error (MSE) that is smaller than either pooling or not, and sometimes substantially improves over both extremes. The bias introduced is small, as is shown by a simulation study that is carefully matched to a real data example. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.

💡 Deep Analysis

Deep Dive into Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach.

📄 Full Content

1. Introduction. Allele frequency at a genetic marker is one of the most important elements in studies of genetic diversity, as well as in populationbased disease association studies. It plays a pivotal role in linkage studies, which model the allelic identical by descent probability, and in association studies, which directly compare the allele frequency between the affected cases and unaffected controls. Moreover, once a disease variant has been identified, accurate assessments of the allele frequency of the variant enable us to evaluate the proportion of the disease burden in a specific population that is attributable to the variant. Fueled by the recent developments in high-throughput genotyping technologies, various efforts are underway to characterize allele frequencies at a genome-wide scale in diverse populations. However, because of the still significant costs associated with these high-throughput platforms, current large-scale genomic projects often assay a large number of markers in a small number of individuals. For example, the International HapMap Project has genotyped more than four million single nucleotide polymorphisms (SNP) in 90 Africans from Nigeria (60 of which are unrelated individuals), 90 U.S. residents with northern and western European ancestry (60 of which are unrelated individuals), 45 Han Chinese from Beijing and 45 Japanese from Tokyo [International HapMap Consortium (2005)]. In another effort, Perlegen Sciences genotyped 71 Americans of European, African or Han Chinese ancestry [Hinds et al. (2005)]. The maximum likelihood estimate (MLE) of allele frequency, in this case just the observed proportion of one allele, has a binomial sampling error, which can be substantial for small samples. Small sample sizes remain a concern, even as more individuals are being genotyped, because there is a simultaneously growing concern about population stratification [Lander and Schork (1994)].

When genotypes are available from individuals representing the same populations, the allele frequency estimates can be improved by combining genotype data. On the other hand, injudicious combining of samples representing distinct populations can lead to biased estimates, as population stratification and genetic drift lead to divergence in allele frequencies among populations [Fisher (1922) and Wright (1931)]. Unfortunately, deciding whether two samples represent a homogenous population, and hence are combinable, is a delicate and subjective decision. Do the Han Chinese from Beijing (HapMap sample) and those from Los Angeles (Perlegen sample) represent the same population? Can we use the HapMap African genotypes to improve frequency estimates of Perlegen African Americans? One possible approach to address such ambiguity is a two-stage approach: one first tests whether the two samples are combinable, using a random set of markers and a procedure such as Devlin and Roeder (1999) or Pritchard and Rosenberg (1999), and, in a second stage, combine or not combine depending on the outcome of the first-stage test. This two-stage procedure, however, suffers from two potential problems. First, when only a small number of individuals have been genotyped, the first-stage test may not have sufficient power to detect the difference; on the other hand, with a sufficiently large sample size, any trivial noncongruency leads to rejection of the test, and therefore voids the possibility to combine samples. Second, the first-stage test can introduce a bias since only similar allele frequencies are allowed to be combined.

Bayesian and empirical Bayes approaches offer flexible venues for combining multiple sources of information. Lange pioneered an empirical Bayes approach for estimating allele frequencies of a single marker using data at the same marker from multiple populations [Lange (1995)]. Lockwood, Roeder and Devlin (2001) extended this approach to incorporate multi-loci genotype information via a Bayesian hierarchical model. Both methods employ a Dirichlet(α) distribution to describe the dispersion of frequencies between the different populations. The two approaches differ in how α is estimated: Lange’s method estimates α by maximum likelihood at each locus separately; while Lockwood, Roeder and Devlin (2001) borrow strength across loci. These two methods are described in greater detail in Section 2.6.5. Additionally, there is a rich literature in modeling population structure and divergence using genetic polymorphism data, although the primary interests are inferences about population history and estimating parameters such as genetic distance and population size [Kitada, Hayashi and Kishino (2000), Nicholson et al. (2002) and Wilson, Weale and Balding (2003)].

In this paper we propose a new empirical Bayes approach, which offers an adaptive procedure to combine multiple samples. This method avoids the problems associated with the two-stage procedure by introducing an affinity measure, ν, which is based on the glob

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

MCMC Inference for a Model with Sampling Bias: An Illustration using SAGE data

Low Dimensional Embedding of fMRI datasets

Simultaneous confidence intervals for the population cell means, for two-by-two factorial data, that utilize uncertain prior information

Start searching

No results found