The perennial problem of "how many clusters?" remains an issue of substantial interest in data mining and machine learning communities, and becomes particularly salient in large data sets such as populational genomic data where the number of clusters needs to be relatively large and open-ended. This problem gets further complicated in a co-clustering scenario in which one needs to solve multiple clustering problems simultaneously because of the presence of common centroids (e.g., ancestors) shared by clusters (e.g., possible descents from a certain ancestor) from different multiple-cluster samples (e.g., different human subpopulations). In this paper we present a hierarchical nonparametric Bayesian model to address this problem in the context of multi-population haplotype inference. Uncovering the haplotypes of single nucleotide polymorphisms is essential for many biological and medical applications. While it is uncommon for the genotype data to be pooled from multiple ethnically distinct populations, few existing programs have explicitly leveraged the individual ethnic information for haplotype inference. In this paper we present a new haplotype inference program, Haploi, which makes use of such information and is readily applicable to genotype sequences with thousands of SNPs from heterogeneous populations, with competent and sometimes superior speed and accuracy comparing to the state-of-the-art programs. Underlying Haploi is a new haplotype distribution model based on a nonparametric Bayesian formalism known as the hierarchical Dirichlet process, which represents a tractable surrogate to the coalescent process. The proposed model is exchangeable, unbounded, and capable of coupling demographic information of different populations.
Deep Dive into A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data.
The perennial problem of “how many clusters?” remains an issue of substantial interest in data mining and machine learning communities, and becomes particularly salient in large data sets such as populational genomic data where the number of clusters needs to be relatively large and open-ended. This problem gets further complicated in a co-clustering scenario in which one needs to solve multiple clustering problems simultaneously because of the presence of common centroids (e.g., ancestors) shared by clusters (e.g., possible descents from a certain ancestor) from different multiple-cluster samples (e.g., different human subpopulations). In this paper we present a hierarchical nonparametric Bayesian model to address this problem in the context of multi-population haplotype inference. Uncovering the haplotypes of single nucleotide polymorphisms is essential for many biological and medical applications. While it is uncommon for the genotype data to be pooled from multiple ethnically disti
1. Introduction. Recent experimental advances have led to an explosion of data that document genetic variations within and between populations. For example, the International SNP Map Working Group (2001) has reported the identification and mapping of 1.4 million single nucleotide polymorphisms (SNPs) from the genomes of four different human populations. These data pose challenging inference problems whose solutions could shed light on the evolutionary history of human population and the genetic basis of disease propensities [Chakravarti (2001), Clark (2003)].
SNPs represent the largest class of individual differences in DNA. A SNP refers to the existence of two possible nucleotide bases from {A, C, G, T } at a chromosomal locus in a population; each variant, denoted as 1 or 0, is called an allele. A haplotype refers to the joint allelic identities of a contiguous list of polymorphic loci within a study region on a given chromosome. Diploid organisms such as human beings have two haplotypes in each individual, one maternal copy and one paternal copy. When the parental chromosomes come in pairs, two haplotypes go together and make up a genotype which consists of the list of allele-pairs at every locus. More precisely, a genotype is resulted from a pair of haplotypes by omitting the information regarding the specific association of each allele with one of the two chromosomes-its phase, at every locus. The problem of haplotype inference, which is the focus of this paper, concerns determining which phase reconstruction among many alternatives is more plausible. Common biological methods for assaying genotypes typically do not provide phase information for individuals with heterozygous genotypes at multiple loci. Although phase can be obtained at a considerably higher cost via molecular haplotyping [Patil et al. (2001)], or sometimes from analysis of trios [Hodge, Boehnke and Spence (1999)], it is desirable to develop automatic and robust in silico methods for reconstructing haplotypes from the inexpensive genotype data.
Key to the inference of individual haplotypes based on a given genotype sample is the formulation and tractability of the marginal distribution of the haplotypes of the study population. Consider the set of haplotypes, denoted as H = {h 1 , h 2 , . . . , h 2n }, of a random sample of 2n chromosomes of n individuals. Under common genetic arguments, the ancestral relationships among the sample back to its most recent common ancestor (MRCA) can be described by a genealogical tree known as the coalescent; computing the P (H) involves a marginalization over all possible coalescent trees compatible with the sample, which is widely known to be intractable. Li and Stephens (2003) suggested to approximate P (H) by a Product of Approximate Conditionals (PAC). The PAC model tries to incorporate a desirable evolution assumption known as the parental-dependent-mutation (PDM) by modeling each h i as the progeny of a randomly-chosen existing haplotype, and it forms the basis of the PHASE program, which has set the state-of-theart benchmark in haplotype inference. However, the PAC model implicitly assumes existence of an ordering in the haplotype sample, therefore, the resulting likelihood is not exchangeable as one would expect for the true P (H). Moreover, since PAC involves no explicit ancestral genealogy over existing haplotypes, certain latent demographic information such as founding haplotypes and their mutation rates are not directly captured in the model.
The finite mixture models represent another class of haplotype models that rely very little on demographic and genetic assumptions of the sample [Excoffier and Slatkin (1995), Niu et al. (2002), Kimmel and Shamir (2004), Zhang, Niu and Liu (2006)]. Under such a model, haplotypes are treated as latent variables associated with specific frequencies, and the haplotype inference problem can be viewed as a missing value inference and parameter estimation problem, for which numerous statistical inference approaches have been developed, such as the maximum likelihood approaches via the EM algorithm [Excoffier and Slatkin (1995), Hawley and Kidd (1995), Long, Williams and Urbanek (1995), Fallin and Schork (2000)], and a number of parametric Bayesian inference methods based on Markov chain Monte Carlo (MCMC) sampling [Niu et al. (2002), Zhang, Niu and Liu (2006)]. However, this class of methods has rather severe computational requirements in that a probability distribution must be maintained on a (large) set of possible haplotypes. Indeed, the size of the haplotype pool, K, which reflects the diversity of the genome, is unknown for any given population data and needs to be inferred. There is a plethora of combinatorial algorithms based on various hypotheses, such as the “parsimony” principles that offer control over the complexity of the inference problem [see Gusfield (2004) for an excellent survey]. On the other hand, most current methods based on statistic
…(Full text truncated)…
This content is AI-processed based on ArXiv data.