A hierarchical Bayesian approach for estimating the origin of a mixed population

IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 237–250 c  Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 74 A hierarc hical Ba y es ian a pproac h for estimating the origin of a mixed p opulation ∗ F eng Guo 1 , Dipak K. Dey 2 and Ken t E. Holsinger 3 Vir g inia Polyte chnic Institute , University of Conne cticut and University of Conne ctic ut Abstract: W e propose a hierarc hical Ba y esian mo del to estimate the prop or- tional con tribution of source p opulations to a newly f ounded colon y . Samples are derived from the ﬁrst generation oﬀspring in the colon y , but mating ma y occur preferent ially among migrants f r om the same source population. Geno- t ypes of the newly founded colony and source p opulations are used to estimate the mixture prop ortions, and the mixture prop ortions are related to environ- men tal and demographic factors that might aﬀect the colonizing pro cess. W e estimate an assortativ e mating coeﬃcient, mixture prop ortions, and regression relationships b et we en environmen tal factors and the mixture prop ortions in a single hi erarch ical m odel. The ﬁrst-stage like liho o d for genotypes in the newly founded colony is a mixture multinomial distribution reﬂecting the colonizing process. The environmen tal and demographic data are incorp orated int o the model through a hi er arc hical pr ior structure. A simulation study is conducted to inv estigate the p erformance of the mo del by usi ng diﬀerent leve ls of p opu- lation divergence and num ber of genetic markers i ncluded in the analysis. W e use Mark o v chain M on te Carl o (MCMC) simulation to conduct infer ence for the p osterior distri butions of mo del parameters. W e apply the mo del to a data set derived from grey seals in the Orkney Islands, Scotland. W e compare our model with a similar model previously used to analyze these data. The r esults from both the simulation and application to real data indicate that our mo del prov ides b etter estimates f or the cov ari ate eﬀects. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 2 Mo dels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 9 3 Sim ulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 4 Application to the gr ey s eal data set . . . . . . . . . . . . . . . . . . . . . 246 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Appendix: A ge ne r al approach for up dating a prop ortiona l vector . . . . . . . 249 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 ∗ Supported by NIH Gran t 1R01-GM068449-01A1. 1 Departmen t of Statistics, Virginia Polytec hnic Institute, Blac ksburg, V A 24060, USA, e-mail: feng.guo @vt.edu 2 Departmen t of Statistics, Univ ersit y of Connect icut, Storrs, CT 06269, USA, e-mail: dey@stat .uconn.e du 3 Departmen t of Ecology and Ev olutionary Biology , University of Connecticut, Storrs, CT 06269, USA, e-mail: kent@dar win.eeb. uconn.edu AMS 2000 subje c t classiﬁc ations: Pr imary 60K35, 60K35; secondary 60K35. Keywor ds and phr ases: hier ar c hical Bay es, MCMC, multinomial. 237 238 F. Guo, D . K. De y and K. E. Holsinger 1. Introduction Fisheries scientists and marine biologis ts are often faced with the problem of iden- tifying prop ortions of individuals in a single catch tha t come from diﬀerent sto cks. Estimating these prop ortions is necessary for e v aluating the eﬀect of commercia l ﬁsheries o n individua l ﬁsher ies sto cks and for understanding the ecologica l factors that inﬂuence the r elative co n tributions of diﬀerent stocks. Similarly , those who study marine mammals ar e often in terested in identifying the so urce p opulations for newly founded colonies as well a s environmental o r demographic factors that inﬂuence the r e lative contributions of diﬀeren t so urces. The increa sing eas e with which genetic data are collected a nd the tendency for p o pulations o f spec ie s to bec ome genetically diﬀerentiated ov er time has led to the increase in using ge- netic markers to estimate the prop ortio na l contribution of source populatio ns to mixed sto cks. The rationale is simple: alle le frequencies are likely to diﬀer among source p opulations, and genot yp e freq uencies in the ha r vest site/new habitat ar e determined by the prop or tional contributions of the source p opulatio ns. Both the diﬀerences among source populatio ns and the mixture pr op ortions ca n b e detected by a ppr opriate s tatistical mo dels. Several metho ds have b een developed for the inference o f the pr op ortional co n- tribution, m , wher e m i is the p ercentage of individuals in the mixed p opulation originating from so ur ce i . Conditional Ma x im um Likelihoo d Estimates (MLEs ) hav e bee n widely used [ 8 , 9 ]. The conditional MLE ass umes the sampled source p o pu- lations are ex haustive lists o f all p ossible sources and the allele frequency of the sources ar e known without error. Neither a ssumption is s atisﬁed for real samples. Smouse et a l. [ 12 ] extended conditional MLEs to unconditional MLEs in which the source a llele fr equencies are trea ted as es timated par ameters a nd unknown so urces are allow ed to b e presented in the mo del. In r ecent years, several authors have applied Bay esian metho ds to the sto ck mixture problem [ 2 , 3 , 4 , 9 , 11 ]. One adv antage of a Bay esian mo del is that the inﬂuence of the genotype data from the mixture population on the estimation of source allele freq ue nc y is fully incorpo rated thr o ugh the joint likeliho o d a nd is reﬂected in the p o sterior distribution of allele freq ue nc ie s. Another adv an tage is that we can include non-g enetic infor mation in the model through appropriate priors. W e ca n, for e x ample, s et the prior distribution of mix tur e pr op ortions as a function of ecolo gical or demographic parameter s . Un til r ecently interest has larg ely fo cused on infer ence o f the prop ortional sto ck contributions, but ther e is incre a sing interest in understanding the ecologica l fac- tors that inﬂuence those prop ortions , e.g., source p opulation size and the distance betw een the so urce and mixture habitat [ 3 , 9 ]. While including these re la tionships is diﬃcult to implement in classic a l mo dels , a hier archical Bay esian mo del can easily incorp ora te these relations hips into the prio r for the vector of pr o p ortional contri- butions, m . Existing approa ches for inference on m either use an additive log istic transformatio n with parameter s assumed to have normal prio rs on the logistic sca le [ 9 ] or mo del m directly on the simplex using a Dirichlet with para meters assumed to be lognorma l [ 3 ]. In this pap er, we prop ose an alternative form ulation of Diric hlet prior structure that is b oth more eﬃcie nt in separa ting mean and v ariance eﬀects and also dir ectly interpretable. W e conduct a simulation study to demonstrate our approach and inv estigate its perfo rmance by v arying the level of diﬀeren tiation among sour c e p opulations and num b er o f genetic markers. W e apply our mo del to the data derived from grey seals in the Orkney Islands and compar e o ur r esults with those obtained with a Dirichlet–lognormal pr ior [ 3 ] and with a model using a Origin of a mixe d p opulation 239 uniform prior for m . 2. M o dels W e conduct our a nalysis in a Bayesian fra mework. The pa r ameters, such as (rela - tive) allele fr equency , P , and pro p o rtional contribution, m , are considered as r an- dom v ariables and the statistical inference is ba sed on the p os terior distributions o f parameters . In this analys is, the genetic data from sour ce a nd mixture po pula tions are included in the likeliho o d function and the cov ariate information is included through a hierar chical prior structure. The likelihoo d o f the data is derived from genetic theory . The genetic da ta consis t of tw o parts: the alle le counts fro m so urce p opulatio ns and genotype counts from the mixed p o pula tion. Gagg io tti’s model [ 3 ] deals with the situation where there is o ne ne w co lony a nd se veral sour ce p opulations that might contribute to the founding gro up of the new colony . The data are collected from the ﬁrst genera tio n descendants of migrants, but the mo del a llows for non-random, assor tative mating, i.e., individuals from the sa me source po pulation ar e more likely to mate with o ne another than thos e fro m diﬀerent sour ce p o pulations. Consider a ﬁrst g eneration descent individual k in the new co lony whose mother is fr om p opulation i a nd father is from p opula tion j . Denote P ( y k | ij ) as the proba- bilit y tha t this individual has genotype y k , which includes L lo ci. Denote ( a 1 lk , a 2 lk ) as the ge notype of individual k a t lo cus l . First c onsider individuals with both pa r- ent s from the same p opula tion, i.e., i = j . Ass ume mating is random among those individuals and Hardy–W einberg E quilibrium (HWE), whic h states that the fre- quency of the heterozygo us g enotype is t wice that of the homozygous genotype, holds. The proba bilit y of genotype y k is, (2.1) P ( y k | ii ) = L Y l =1 δ lk p a 1 lk ; li p a 2 lk ; li , where p a 1 lk ; li is the allele fre q uency of a 1 lk at lo cus l in p opulatio n i , and δ lk is an indicator v a riable deﬁned as δ lk = ( 1 , if a 1 lk = a 2 lk , 2 , if a 1 lk 6 = a 2 lk . When the parents are from diﬀerent p o pulations, i.e., i 6 = j , P ( y k | ij ) is given b y (2.2) P ( y k | ij ) = L Y l =1 ( p a 1 lk ; li p a 2 lk ; lj + γ lk p a 2 lk ; li p a 1 lk ; lj ) , where γ lk =  0 , if a 1 lk = a 2 lk , 1 , if a 1 lk 6 = a 2 lk . Parameter γ lk indicates that when alleles at a lo cus a re diﬀerent, there are tw o diﬀerent ways of assig ning them to parents in diﬀerent so urce popula tions. When mating happ ens ass ortatively , i.e., individuals tend to mate w ith those from the same s ource p o pula tion, HWE is not v alid. The assor tative mating ca n b e mo deled b y an assor tative mating co eﬃcient ω ∈ (0 , 1). Sp eciﬁca lly , a prop ortion 240 F. Guo, D . K. De y and K. E. Holsinger ω of ﬁrs t generatio n descendants arise from ass ortative mating among individuals from the same s ource and a pr op ortion 1 − ω a rise from r andom mating among all migrants. Conseq uent ly , the likeliho o d of ﬁnding the geno type y k in a sample from the new colo ny is as follows: P ( y k | ω , P , m ) = ω I X i =1 m i P ( y k | ii )+ (1 − ω )   I X i =1 m 2 i P ( y k | ii ) + I X i =1 X j 6 = i m i m j P ( y k | ij )   , (2.3) where P ( y k | ii ) and P ( y k | ij ) is as in ( 2.1 ) and ( 2.2 ). Since HWE is as sumed for source p opulatio ns, the genotype frequency is de- termined by the allele fr equency . It is easy to show that the likeliho o d asso ciated with the genotype frequencies is equiv alen t to a multinomial with para meters cor- resp onding to the a lle le frequencies and r esp onse v ariables as allele co un ts from source po pulations. If we assume indep e ndenc e among the genotype counts across lo ci and p opula tio ns, the likelihoo d function for s ource allele counts is a pro duct m ultinomial: P ( N | P ) ∼ I Y i =1 L Y l =1 A l Y j =1 p N jli j li , where N j li is the allele count for source po pulation i at lo cus l for a llele a j and A l is the num ber of alleles at lo cus l . The prior distributions for ω and P r eﬂect prior beliefs ab out plausible v a lue s for these para meters. W e choose v ag ue/noninformative prior s to reﬂect the fact that we hav e no reason to exp ect particular v alues for thes e par ameters. In particular , we a ssume an unifor m pr io r on (0 , 1) for the a ssortative mating co eﬃcient ω . F or allele frequencies in the source p opulations , P , we assume a Dirichlet prior π ( p | α ) ∝ I Y i =1 L Y l =1 A l Y j =1 p α jl − 1 j li . As there is no previous data or preference for P , it is rea sonable to take α ’s all equal to 1, leading to a symmetric Dirichlet prior with para meter 1. The key to this analy sis is how to incorp orate the demogra phic/environment al factors into the es timation of the prop ortio nal contribution m . Information on m is obtained only indirectly thro ugh its inﬂuence on genotype frequencies in the mixed po pulation. Thus, ther e is no simple data likelihoo d connecting m and the cov ari- ates. In a Bay esian framework, howev er, we ca n assig n an infor mative prior for m containing the informa tion from the cov aria tes. The challenge in doing a r egress ion t yp e analysis is that the sum of the comp onents of m must be equal to 1. Since cov a riates hav e to b e considere d for every source, an ordinar y linear mode l or logit transformatio n do es not ﬁt here . Okuyama a nd Bo lker [ 9 ] overcome this problem by using an a dditive log ra tio tr ansformation bas ed on a n additive logistic normal distribution. How ever, a ba seline p opulation has to b e s e lected and the cov ariates need to be adjusted accor ding to the baseline p opula tio n, which makes it ha rd to accommo date multiple cov aria tes and the in terpretation o f the co eﬃcients is no t straightforward. Gagg iotti et al. [ 3 ] use a hiera rchical Dirichlet prio r to addr ess this problem. In Ga ggiotti’s setup, the ﬁrst level prior distribution for m is a Dirichlet Origin of a mixe d p opulation 241 distribution in whic h the individual para meters follow a lognor mal distribution, i.e., m ∼ D ( ψ ) , (2.4) log( ψ i ) ∼ N ( µ i , σ 2 ) , µ i = α 0 + p X r =1 α r G r i , where G r i is the v alue of the r th factor for so ur ce p opulatio n i and α = ( α 0 , . . . , α p ) is the vector of regress ion co eﬃcients. The v alue of the cov aria tes are standa rdized and the prior for α r is N (0 , σ 2 p ). The cov a riates aﬀect the prior throug h the para m- eter ψ . W e introduce a new hierarchical prior structure for the mix tur e prop ortions m . The ﬁrs t level prior for m is a Dirichlet dis tribution with para meters ((1 − ρ ) /ρ ) ϕ , wher e ρ ∈ (0 , 1 ) and ϕ are the hyperpara meters of the prior sub ject to the constraint P I i =1 ϕ i = 1. This form of prior is widely used in p o pulation genetics for its relationship with the measure of popula tion diﬀerentiation, e.g., W right’s F S T [ 1 , 5 , 6 ]. Due to the fac t that the cov ar iates ar e also observed, another hierarchical level is added to inco r p orate the randomness. A Dir ichlet prio r with par ameter η is a ssigned to ϕ . The cov ariates ar e included in the mo del by setting the logarithm of η to b e a function of a linear combination of the cov a r iates, i.e., m ∼ D ( 1 − ρ ρ ϕ ) , ϕ ∼ D ( η ) , (2.5) log( η i ) = α 0 + p X r =1 α r G r i . Since the cov ariates are no rmalized, the regressio n co eﬃcients α r ’s are assumed to be indep endent of each other. Nor mal prio r s with mean zero and a large v ariance, σ 2 p = 10 , ar e assigned to para meter α . The full mo del is: π ( P , ω , m , ρ , ϕ , α | Y , N ) ∝ p ( Y | P , ω , m ) π ( m | ρ, ϕ ) π ( ϕ | α ) π ( α ) π ( ω ) p ( N | P ) π ( P ) , (2.6) where Y is the genotype data of the new co lo ny , N is the alle le coun t in source po pulations, and P is the a lle le fr e q uency . The Directed Acyclic Gr a phs (D A Gs) of the Dirichlet–Dirichlet and Dirichlet–Lognormal mo dels ar e pr esented in Figure 1 and Figure 2 . The prior structure of o ur mo de l puts the supp ort of ρ betw een (0 , 1) and the v alue of ((1 − ρ ) /ρ ) ϕ i on the entire p ositive line. This setup brings several adv an- tages. First, a natural v ague prior for ρ is simply a unifor m distribution betw een 0 and 1, U (0 , 1). Seco nd, when we use U (0 , 1) as a pr ior, the p oster ior mean of ρ can be used as an indicator of the disp ersion of the regres sion of m on de- mograpic/ environmen tal factors. The v a riance ass o ciated with comp onent m i is ϕ i (1 − ϕ i ) ρ , and 1 − ρ is roughly the prop or tion of v ar iance in m explained by the regre s sion (cf. [ 6 ]). Third, an informative prior on ρ ca n b e use d to inﬂuence the v ariance o f the prior and the relative weigh t of e nvironmental cov ar iates and 242 F. Guo, D . K. De y and K. E. Holsinger Fig 1 . Dirichlet–Dirichlet mo del. Fig 2 . Diric hlet–L o gnormal mo del. genetic da ta on the po sterior of m . T o see this, observe that (1 − ρ ) /ρ increases when ρ decreases a nd the v aria nce o f D (((1 − ρ ) /ρ ) ϕ ) is de c ided by the abs o lute v alue of ((1 − ρ ) /ρ ) ϕ . Thus a small v a lue of ρ corre s po nds to a small v ariance. As the v ariance of the prior usually decides the relative weight of the pr ior information on the p o sterior dis tr ibution, ρ indicates the r elative weigh t of the cov ariates in the po sterior distribution o f m . In the mo dels of [ 3 ], [ 9 ], and the mode l prop os ed in this pap er, the eﬀect of the cov ariates is incor po rated through the prior for m . The parameter o f interest, m , is determined by both genetic information through the likeliho o d and demo- graphic/e nvironmental information through the prio r. An impor tant ques tion is, what is the r elative inﬂuence of these t w o so urces of information o n pos terior in- ference? The inﬂuence of the prio r is usually directly related to its v ariance: with a large v ar iance the poster ior is dominated b y the lik elihoo d while with a small v aria nce the pos terior is dominated by the prior. F or the Dirichlet–lognormal prior in ( 2.4 ), the mean o f m is controlled by the v alue o f ψ . At the s ame time the v a ri- ance of m is aﬀected by b oth the magnitude of ψ a nd the distribution of σ 2 . The int eraction a mong these t w o pa rameters tends to increa se the uncertaint y in the po sterior distribution. In c ontrast, the prior pro p o sed in ( 2.5 ) clear ly separ ates the roles of the par ameters: the mean of m is deter mined by ϕ and the v ariance of m is c ontrolled b y ρ . T his separa tion is due to the constraint that ϕ is on the s implex. In the simulation study and the applicatio n to a r eal life da taset, we illus trate that the Diric hlet–Dirichlet prio r shows les s v ariation in es timating the cov ariate eﬀects while providing co mparable cov erage of int erv a l estimates. Origin of a mixe d p opulation 243 3. Si m ulation study W e conduct a simulation study to inv estigate the p erformance o f our propo sed mo del by simulating da ta from po pulations with diﬀerent levels of genetic dif- ferentiation as well as diﬀerent num bers of genetic markers. As discussed above, the estimation of the mixtur e pro p ortions r elies on the div ergence among s ource po pulations. W e ar e interested in how diﬀeren t level of div ergence among source po pulations w ould aﬀect the p osterior distributions of the par a meters of interest. F rom a practical p oint of v iew, p opulation divergence level ca nnot b e cont rolled by resear chers. Ins tead, resear chers can deter mine how many genetic markers are to b e assigned and included in the analysis. Hence, we a re also in terested in the relatio n- ship b etw een the n um ber of lo ci and the p oster ior distribution of the parameters . W e co nsider three simulation scenario s. Under the ﬁrst scena r io, the level o f po pulation diﬀerentiation is mo derate and there are a rela tively small num ber of genetic markers, e.g., 8 lo ci are av a ila ble. Under the second s cenario, the num ber of lo ci is the s ame as the ﬁrst scenar io but the level o f ge ne tic diﬀeren tiation a mong source p opulations is higher . Under the last scenario, the genetic v ar iation is the same as that of the ﬁrst scenar io but the num b er of g enetic ma rkers is doubled, i.e., 16 genetic markers a re a v ailable. The num ber of source p opulations, num ber of individuals in the mixed po pulation, and allele counts in the source p opulations are compar able with those in the gr ey seal data we a nalyze later. In the ﬁrst part of the simulation we genera te allele c ounts in the so urce p opu- lations, whic h should reﬂect the level of genetic diﬀerentiation a mong them. This is r ealized through a hierar chical po pulation structure. W e assume that the al- lele freque nc ie s of the source p opula tions are fr om a common hyper-p opula tion, which has ﬁxed allele frequencies ψ , a L × A matrix with L being the num ber of lo ci and A being the num ber of alleles at each lo cus. (Without loss of generality , we assume all lo ci have the same num ber of alleles.) The allele frequenc ies p li , a 1 × A vector, for s ource i a nd lo cus l ar e r andom samples fro m a Dirichlet distri- bution, D (((1 − θ ) /θ ) ψ l ), where ψ l , a 1 × A vector, is the alle le frequency of lo cus l for the hyper-p opulatio n, and θ is a p opula tio n divergence measure used widely in populatio n genetic studies, namely W right’s F st . No te that E [ p j li ] = ψ j l and V a r [ p j li ] = θ ψ j l (1 − ψ j l ), where ψ j l is the allele frequency of lo cus l , allele j for the hyper-p opulation. W e cho ose θ = 0 . 0 5 a nd θ = 0 . 2 for small and large divergence scenarios , re sp ectively . The deta iled s im ulation is descr ib ed a s follows. Step 1: generate allele fr equencies of the h yper p opulation, ψ , by generating a r andom sample L times from an A di- mensional s ymmetric Dirichlet distribution with parameter 1. Step 2: gene r ate allele frequencies, p li , from the Dirichlet distributio n D (((1 − θ ) /θ ) ψ l ) with predeﬁned θ . Step 3: genera te a llele c o unts, N li , for sourc e i a nd locus l , from a multinomial distribution with tota l a llele counts N = 4 00, and proba bility p li (from step 2). In the second par t of the simulation we genera te genotypes of indiv iduals from the mix e d p o pulation, which requir es the prop ortio nal contributions m and the probability of ea ch genotype. W e adopt ﬁxed prop or tional co ntributions, which ar e a function of the tw o cov a riates. Note that in b o th the Dirichlet–Dirichlet ( 2.5 ) and the Dirichlet–lognorma l ( 2.4 ) mo dels, the conditional exp ecta tio ns of the prior for m ar e the same, namely , E [ m i | α ] = e α · G i P I i =1 e α · G i , where α is the vector of regres sion co e ﬃcient s and G i is the vector of cov ar iates for 244 F. Guo, D . K. De y and K. E. Holsinger T a ble 1 Normalize d c ovariates Source Distance Pro ductivi ty 1 − 0 . 295 1 . 298 2 − 0 . 849 1 . 285 3 − 0 . 822 − 0 . 238 4 − 0 . 562 − 1 . 256 5 − 0 . 326 − 0 . 729 6 1 . 533 0 . 286 7 1 . 320 − 0 . 646 source i . W e use t w o cov ar iates with the v alues shown in T able 1 and the co eﬃcients are set to α 1 = − 0 . 5 a nd α 2 = 0 . 5 . The genotype of a n individual k is generated b y the following steps. First, we decide whether its parents are from the same so urce by co mparing a unifor m random nu m ber on [0 ,1] with a pres et assor tative co eﬃcient w = 0 . 0 5 . The second step is to generate the g enotype frequency at each lo cus. If the parents are from the same source popula tion, the probability of genotype y k is P I i =1 m i P ( y k | ii ), where P ( y k | ii ) is a s in ( 2.1 ). If the parents co me randomly from the s ource p opula tions then the probability of genotype y k is P I i =1 m 2 i P ( y k | ii )+ P I i =1 P j 6 = i m i m j P ( y k | ij ), where P ( y k | ij ) is the probability o f parents from diﬀerent sourc e p opula tions as in ( 2.2 ). Once w e hav e the probability of each genotype for individual k at lo cus l , we can easily gener a te the genotype fr o m this probability . Step 2 is re p ea ted for e a ch lo cus of the individual to get the co mplete genotype of individual k . The a bove steps are r e pea ted 1 60 times to g et the genotypes of 160 individuals in the mixed po pulation. W e generate 50 data sets fo r each o f the three scenar ios, and we ﬁt b oth the Dirichlet–Diric hlet pr io r ( 2.5 ) prop osed in this pap er and the Dirichlet –lognor mal prior ( 2.4 ) to each data se t using a MCMC metho d. Since most of the parameter s are vectors on a simplex, w e use a multi-dimensional logit tr ansformatio n to put the suppo rt of the transfo rmed parameters on the real line and remove the simplex con- straint. A normal prop osa l density is then used to conduct a Metrop olis– Ha stings upda te nested in the Gibbs s ampling. Details of the MCMC up date pr o cedure are presented in the Appe ndix. F or mo st of the da ta set, we conduct 30,0 00 iterations in the simulation with 5 ,000 burn-in and thin the MCMC output by 5. F or chains showing suspicious c o nv ergence b ehavior, longer itera tions and ﬁne tuning ar e used to ensure co nvergence. T able 2 pres ent s a summar y of the p osterio r analysis , including the av erage o f the po sterior means , pos terior standar d devia tions, ro ot mea n sq ua re erro r (RMSEs), and the lengths of the 95% highest pr o bability density (HPD) interv als. In general, the p osterior means of m are reasonably c lose to the true v alues in all scenarios . The eﬀects of p o pula tion divergence and n um ber of lo ci are reﬂected ma inly in the po sterior disper sion of m . As s hown in the T able, the lengths of the 95% HPD int erv als , the po sterior standard devia tions, and the RMSEs, all indicate that the po sterior disp ersion of m decr e ases with the increase of p opulation diﬀerentiation. Given the sa me level of po pulation diﬀerentiation, increa s ing the num ber o f genetic markers also signiﬁcantly impr ov es the precision o f p oster ior es timation fo r m . These results sug gest that although in pra ctice the p o pulation divergence is a lways ﬁxed, collecting and including more genetic mar kers in the analys is can signiﬁcantly improv e the estimatio n of the prop or tional cont ribution parameter s m . F or the regression co eﬃcient α , b oth models provide rea sonable estimates for the p oster ior means. How ever, the p o s terior v ar ia tion is la rge and the 9 5% HPD Origin of a mixe d p opulation 245 T a ble 2 Posterior summary of simulation study Dirichlet–Dirichlet Dirichlet–lognormal TRUE θ = 0 . 05 θ = 0 . 20 θ = 0 . 05 θ = 0 . 05 θ = 0 . 2 θ = 0 . 05 L=8 L=8 L=16 L=8 L=8 L=16 m 1 0 . 249 ∗ 0 . 256 0 . 254 0 . 239 0 . 258 0 . 255 0 . 239 ∗∗ 0 . 044 0 . 033 0 . 036 0 . 045 0 . 032 0 . 036 ∗∗∗ 0 . 063 0 . 052 0 . 053 0 . 065 0 . 052 0 . 054 ∗∗∗∗ 0 . 170 0 . 127 0 . 138 0 . 173 0 . 126 0 . 139 m 2 0 . 327 0 . 326 0 . 377 0 . 350 0 . 328 0 . 379 0 . 351 0 . 045 0 . 036 0 . 039 0 . 046 0 . 036 0 . 039 0 . 068 0 . 076 0 . 058 0 . 069 0 . 079 0 . 059 0 . 175 0 . 140 0 . 150 0 . 178 0 . 141 0 . 151 m 3 0 . 151 0 . 134 0 . 125 0 . 137 0 . 135 0 . 125 0 . 138 0 . 038 0 . 026 0 . 031 0 . 038 0 . 026 0 . 031 0 . 052 0 . 053 0 . 048 0 . 052 0 . 053 0 . 049 0 . 145 0 . 099 0 . 121 0 . 149 0 . 100 0 . 121 m 4 0 . 079 0 . 073 0 . 065 0 . 077 0 . 072 0 . 064 0 . 075 0 . 033 0 . 020 0 . 027 0 . 033 0 . 020 0 . 026 0 . 042 0 . 038 0 . 039 0 . 044 0 . 038 0 . 039 0 . 119 0 . 075 0 . 103 0 . 121 0 . 074 0 . 100 m 5 0 . 092 0 . 094 0 . 078 0 . 090 0 . 094 0 . 077 0 . 090 0 . 033 0 . 022 0 . 027 0 . 034 0 . 021 0 . 028 0 . 051 0 . 036 0 . 044 0 . 051 0 . 037 0 . 044 0 . 1259 0 . 082 0 . 103 0 . 126 0 . 082 0 . 105 m 6 0 . 060 0 . 069 0 . 060 0 . 060 0 . 066 0 . 059 0 . 059 0 . 033 0 . 020 0 . 025 0 . 033 0 . 020 0 . 024 0 . 045 0 . 032 0 . 038 0 . 046 0 . 033 0 . 038 0 . 115 0 . 074 0 . 090 0 . 114 0 . 074 0 . 089 m 7 0 . 042 0 . 047 0 . 041 0 . 048 0 . 047 0 . 040 0 . 048 0 . 025 0 . 015 0 . 022 0 . 026 0 . 016 0 . 0232 0 . 038 0 . 029 0 . 037 0 . 039 0 . 029 0 . 036 0 . 088 0 . 057 0 . 078 0 . 089 0 . 056 0 . 079 α 1 − 0 . 500 − 0 . 398 − 0 . 429 − 0 . 430 − 0 . 485 − 0 . 493 − 0 . 493 0 . 477 0 . 469 0 . 479 0 . 613 0 . 578 0 . 587 0 . 516 0 . 496 0 . 503 0 . 659 0 . 615 0 . 625 1 . 849 1 . 815 1 . 849 2 . 383 2 . 268 2 . 305 α 2 0 . 500 0 . 449 0 . 520 0 . 433 0 . 538 0 . 618 0 . 515 0 . 416 0 . 407 0 . 408 0 . 535 0 . 519 0 . 523 0 . 437 0 . 425 0 . 422 0 . 555 0 . 551 0 . 535 1 . 637 1 . 600 1 . 602 2 . 097 2 . 041 2 . 056 ω 0 . 050 0 . 037 0 . 014 0 . 014 0 . 037 0 . 014 0 . 014 0 . 035 0 . 013 0 . 014 0 . 035 0 . 013 0 . 014 0 . 040 0 . 039 0 . 038 0 . 040 0 . 039 0 . 038 0 . 106 0 . 040 0 . 042 0 . 106 0 . 040 0 . 042 *: a verage of p osterior means; **: a verage of p osterior standard dev iations; ***: av erage of RMSEs; ****: av erage length of 95% HPD interv als. int erv als all co ntain zero . Results from the sim ulation study indicate that neither level of p opula tion diﬀer e n tiation nor num ber of lo ci ha s signiﬁca nt eﬀects on the precision of α estimates. W e consider this as a reasonable r esult since the cov aria te co eﬃcients ar e essentially a regressio n ov er 7 data p oints, i.e ., the 7 source p opu- lations. The level o f divergence and num b er of lo ci improve the precisions of the po sterior v ariance sfor m , which only aﬀect α indir ectly . With only 7 data p oints, few simulated data sets will b e able to provide strong supp ort for a r egressio n re- 246 F. Guo, D . K. De y and K. E. Holsinger lationship. Thus, incre a sing the n um ber of lo ci or studying highly diﬀerentiated po pulations will do little to improve po sterior estimates of α . A larg er num ber of po pulations would b e r equired to provide statistically supp ortable evidence of the eﬀects. The a dv antages of the Dirichlet–Diric hlet prior prop os ed in this pap er are se en primarily in the reduced p osterior v ariation of the reg ression co e ﬃcient s, α 1 and α 2 . Under the Dirichlet–Dirichlet prior , the po sterior standard devia tions and RMSEs are uniformly smaller than that of the Diric hlet–lognor mal mo del even though the prior v a riances for α are all set to the same v alue, i.e., σ 2 p = 10. W e consider this as mainly due to the confounding of the eﬀects that b oth ψ and σ 2 hav e on the v aria nces of m . Another p oss ible reaso n is the e ﬀects of the pr ior for τ = 1 / σ 2 . In any cas e, the Diric hlet–Dirichlet prio r has the adv an tage o f leading to mor e precise estimation of regres sion co eﬃcients and eas e in picking a non-informative prior without sacriﬁc ing nominal cov erage of cre dible interv als. 4. Appl ication to the grey seal data set T o illustrate the usefulness of our appr oach, w e apply it to data fro m grey seal, Helic o erus grypus , p opulatio ns in the Orkney Islands, whic h were als o analy zed by [ 3 ] and [ 4 ]. The da ta c o nsist of allele frequencies of 8 lo ci for seven source colonies and the ge notype frequencies fo r a newly esta blished colony on Stronsay island. There ar e t w o explanato ry v ariables a s so ciated with ea ch sourc e p opula tio n: dis- tance b etw een the sourc e isla nd and Strons ay island ( α 1 ), and the ’productivity’ index, which is related to the p opula tion density and siz e of the so urce po pula- tion, ( α 2 ). The g enetic data w ere collected from the ﬁrst generation descenda nt s of migr ants to Stronsay . W e use the likelihoo d in equation ( 2.3 ) to allow for the po ssibility that migr ants ar e more lik ely to mate with other individuals fro m the island from which they migr ated. Since there is no closed for m for the p osterior distribution, we us e MCMC metho ds for p oster ior inference. W e compare results from three mo dels with diﬀeren t pr iors: the Dirichlet– Dirichlet prior, the Dir ichlet–Lognormal prior, and a mo de l with the symmetric Dirichlet prior with parameter 1 for m , which corresp onds to a mo del in which cov a riate eﬀects ar e not incor p o rated. Our results reveal that diﬀerences a mong the models rare ly lead to substantial diﬀerences in the mean po sterior lik eliho o d, which is intuitiv ely reaso nable. The part of the likelihoo d function concer ned with source po pulation allele freque ncie s is identical acr oss all mo dels, and the part of the lik eliho o d conce rned with colony allele frequencies is tightly tied to the observed genotypes. The three mo dels diﬀer only throug h the prio r for the prop or tional co n- tributions m , which ha s limited impact o n the likeliho o d unless the so urce p opula- tion diﬀers s ubstantially . A direct c o nsequence of these prop erties is the similarity among metrics for mo del ev alua tio n mea sures that use only the likelihoo d, e.g ., DIC [ 13 ] and the lo garithm of the pseudo marginal likelihoo d (LPML) [ 7 ]. As shown in T able 3 , neither DIC nor LPML provides strong s upp or t for a ny of the mo dels relative to the others. The p os ter ior densities of the mode l parameter s a re given in Figure 3 . T able 4 gives the p os ter ior means and 95% HPD interv als of the pa rameters of interest: m , τ , α and ρ . It ca n b e s een tha t the p os terior means of m are quite diﬀeren t fo r the mo del with symmetr ic Dirichlet prio r (with no cov ariates) and the mo dels us- ing cov aria te infor mation. Spec iﬁc a lly , mo dels us ing cov aria te infor mation suggest a larg er prop or tio n fro m sources 2 a nd 3 than the uniform model, whic h is a r ea- sonable result since sour c e s 2 and 3 a re the clo sest source islands to the new colony Origin of a mixe d p opulation 247 T a ble 3 Mo del evaluation Mo dels Dbar pD DIC LPML Dirichlet–Diric hlet 8044 336 8380 − 3023 Dirichlet–logno rmal 8042 337 8379 − 3023 Uniform 8045 336 8381 − 3023 Fig 3 . The p osterior densit ies of p ar ameters in the Dirichlet–Di ri chlet mo del. T a ble 4 Posterior me ans and 95% HPD intervals Dirichlet–Dirichlet Dirichlet–lognormal Uniform Mean 95% HPD Mean 95%HPD Mean 95% HPD m 1 0 . 097 (0,0.280) 0 . 101 (0,0.282) 0 . 099 (0,0.250) m 2 0 . 297 (0.086,0.526) 0 . 305 (0.085,0.542) 0 . 243 (0.046,0.436) m 3 0 . 3 (0,0.514) 0 . 324 (0.045,0.586) 0 . 258 (0.035,0.475) m 4 0 . 113 (0,0.300) 0 . 11 ( 0,0.305) 0 . 104 (0,0.256) m 5 0 . 061 (0,0.196) 0 . 07 ( 0,0.195) 0 . 081 (0,0.198) m 6 0 . 052 (0,0.214) 0 . 037 (0,0.179) 0 . 092 (0,0.234) m 7 0 . 079 (0,0.270) 0 . 053 (0,0.230) 0 . 123 (0,0.289) α 1 − 0 . 494 ( − 1.808,0.668) − 1 . 03 ( − 3.005,0.961) 0 (0,0) α 2 0 . 113 ( − 0.864,1.084) 0 . 182 ( − 1.191,1.680) 0 (0,0) ω 0 . 609 (0.105 ,1.000) 0 . 613 (0.110,1.000) 0 . 616 (0.056,0.986) ρ 0 . 118 (0,0.343) τ 1 . 453 (0.183,4.377) and po sterior analysis indicates distance has a mo der ate eﬀect on the prop ortion contribution. The 95% HP D interv als of all regr ession co eﬃcients in all models include zero , which is not surprising from the ana lysis of the simulation results. The large v aria- 248 F. Guo, D . K. De y and K. E. Holsinger tion in α is presumably due to the small num ber of s ource p opulations . Nonetheless, there is some supp ort for the notion that the co eﬃcient a s so ciated with dista nce, α 1 , is negative. The posterio r probability that α 1 is neg ative is more than 0 .785 for the Dirichlet–Diric hlet mo del and 0 .850 for the Dirichlet–lognor mal mo del. The po sterior probability that α 2 is po sitive in our mo del is 0.593 and 0.596 in the Dirichlet–lognormal mo del. The Dirichlet–Dirichlet mo del shows a shorter HP D int erv al co mpared to the Dirichlet–lognor mal mo del, which is also c o nsistent with the simulation results. In short, distance has a negative eﬀect o n the propo rtional contributions and p opulation s izes hav e minor po sitive eﬀects o n the pro p o rtional contributions. As discussed ab ov e, the pa rameter ρ in our hierarchical mo del is ana logous to a ‘go o dness of ﬁt’ measure for the relatio ns hip b etw een the cov ar iates a nd m . Spec iﬁc a lly , 1 − ρ is roughly the prop ortion of v aria nce in m explained by the regres s ion. As the results in T a ble 4 show, the p osterior mean for 1 − ρ is near 0.9, whic h indicates a fairly tigh t reg r ession in spite of the uncertainties asso ciated with α . In summary , we conclude that there is mo derate supp o r t for the hypothesis that increasing distances betw een the source and colony po pulations decr ease the prop ortiona l co n tributions of the source s to the colony . 5. C onclusions The primary g oal o f this analy sis is to incorpo rate environmental/demographic information into the estima tion of the pr o p ortional contributions of so urce p opula- tions to a new colony through appropriate informative priors . Two other mo dels are av a ilable which sa tisfy the co ns traint that the sum of the pr op ortional co nt ributions m ust equal one, i.e., additive logistic transformation [ 9 ] and Dirichlet–lognormal mo del [ 3 ]. W e intro duce a pa rametrizatio n for the Dirichlet prior der ived from p op- ulation genetics in which we s p ecify the mean, ϕ i , and v ariance, ρ (1 − ϕ i ) ϕ i , of the mixture para meter s and a linear mo del for the parameters of a seco nd Diric hlet that determines ϕ i . The Diric hlet–Dirichlet prior has several adv antages ov e r the alternatives. First, the parameter ρ has a natural v ague prior distribution, a uni- form distribution [0 ,1]. Second, ρ co nt rols the v ariance o f the Dirichlet prior and 1 − ρ has a natural interpretation as the pr op ortion of v ariance expla ined by re- gressio n. Finally , the mean o f the prop ortiona l co ntributions is not aﬀected by the parameter ρ a nd the regre s sion co eﬃcients hav e a direct interpretation a s regressio n eﬀects o n pro p o rtional repr esentation. The separatio n of mean eﬀect and v ariance eﬀect is a ma jor adv antage of the propo sed formulation co mpa red to alternative mo dels where the prop or tio nal c o ntribution for any given p opula tion dep ends on the re la tive ma gnitude o f co eﬃcients asso c ia ted with other regr ession comp onents and their rando m eﬀects. The simulation study indica tes that large r po pulation divergence would lead to more precis e estimation of the prop or tional contributions m . Given a par ticula r level o f popula tion divergence, b etter estimates of m can also b e achiev ed b y in- cluding more lo ci in the analysis . The simulations show that the Dirichlet–Diric hlet prior ha s better p er formance in es tima ting the r egress ion co eﬃcients in term o f po sterior v a r iation than a Dirichlet–lognor ma l prior . When we a pply our mo del to the grey sea l data we ﬁnd that the distance be- t ween a sour ce island and the new c olony play a mo derate role in its pro po rtional contribution but that the eﬀect of s ource po pula tion pro ductivity is weak. These results a re consistent w ith those presented in [ 3 ], but the po sterior v a riability o f the Origin of a mixe d p opulation 249 regres s ion co eﬃcients is s maller, as in the simulation study . The adv antages of the formulation pr esented here seem likely to b e gener ally av ailable in the ana lysis of comp ositional data. In pa rticular, a fo r mulation similar to the one used here ma y be generally useful in mo deling situatio ns wher e a dditiv e logistic transformations hav e b een the norm, b oth b ecause o f direct in terpretability of reg ression co eﬃcient s and the natur a l in terpretation of 1 − ρ as a go o dness of ﬁt measur e. App endix: A general approac h for up dating a prop ortio nal vector The co nditional distributions o f mo del parameters are non-standa r d distr ibutions; hence, w e use a Metrop olis–Hastings a lgorithm nested within Gibbs sampling to conduct e a ch MCMC up date. Several vector parameters, P , m ,and ϕ , are sub ject to the co nstraint that the s upp or t of their compone nts is on [0,1 ] and the summation equals to one. W e use a multidimensional lo git tr ansformatio n to ‘de-co nstrain’ the parameters and p er form Metrop olis – Hastings updating using a Norma l prop osal density . Let θ be a vector of dimension p +1 with cons tr aints θ i > 0 and P p +1 i =1 θ i = 1. Let θ i = exp( ξ i ) 1 + P p j =1 exp( ξ j ) . The Jaco bian ma trix ∂ f ( θ ) /∂ ξ is the matrix with entries x ij =      e ξ i + e ξ i ( P p j =1 e ξ j ) − e 2 ξ i (1+ P p j =1 e ξ j ) 2 , for i = j , − e ξ i + ξ j (1+ P p j =1 e ξ j ) 2 , for i 6 = j . It can b e shown that the determinant of the Jacobian matrix is e P p j =1 ξ j (1 + P p j =1 e ξ j ) p +1 . The full conditiona l distr ibution of ξ is f ( ξ | D ) = f ( θ | D ) e P p j =1 ξ j (1 + P p j =1 e ξ j ) p +1 . Instead of sampling θ , we co nduct a Metro p o lis-Hastings update for ξ using a normal pr op osal density N ( ˆ ξ , ˆ σ 2 ˆ ξ ), wher e ˆ ξ is the maximizer of π ( ξ | D ) and ˆ σ 2 ˆ ξ is the estimated v a riance, which could b e a ﬁxed v alue based on a pilot run or the inv erse of the scor e matrix. Alternatively , we can use a Normal pro po sal density centered at the curr ent v a lue . The algor ithm op era tes as follows: Step 1. Let ξ be the current v alue. Find the maximum likeliho o d estimate of ξ , ˆ ξ . Step 2. Gener a te a prop osal v alue ξ ∗ from N ( ˆ ξ , ˆ σ 2 ˆ ξ ). Step 3. A move from ξ to ξ ∗ is made with pro bability min          f ( ξ ∗ | D )Φ( ξ − ˆ ξ ˆ σ ˆ ξ ) f ( ξ | D )Φ( ξ ∗ − ˆ ξ ˆ σ ˆ ξ ) , 1          where Φ is the standard normal proba bility density function. The ξ is then conv erted back to its expression in terms of θ . 250 F. Guo, D . K. De y and K. E. Holsinger Ac kno wle dgment s. W e a re indebted to William Amos fo r sharing the data o n grey sea ls that mo tiv ated this study . Collection of these data was suppor ted by NERC g rant GR3/116 62 to W. Amos . References [1] Balding, D. J. and Nichols, R. A. (199 5). A metho d for quantifying dif- ferentiation betw een po pulations at m ulti-allelic lo ci and its implications for inv e s tigating iden tit y and paternity . Genetic a 96 3–12 . [2] Bolker, B. , Okuy ama, T. , Bjornd al, K. and Bol ten, A. (2003 ). Sto ck estimation for sea turtle p o pulations using genetic markers: accounting for sampling erro r o f rare genotypes. Ec olo gic al Applic ations 13 76 3–77 5. [3] Gaggiotti, O. E. , Brooks, S. P. , A mos, W. and Har wood, J. (200 4). Combining demographic, environmental and genetic data to test hypotheses ab out colonization ev ent s in metap opulations. Mole cu lar Ec olo gy 13 81 1–825 . [4] Gaggiotti, O. E. , Jones, F., Lee, W. M., Amos, W. , Har wood, J. and Nichols, R. A . (2002). Patterns of colonization in a metap opula tion of grey seals. Natur e 13 424– 427. [5] Holsinger, K. E. (1999). Analysis of genetic diversity in hierarchically str uc- tured p opulations: a Bayesian p ersp ective. Her e ditas 130 24 5–25 5. [6] Holsinger, K. E. and W allace, L. E. (2004). Bay esian approaches for the ana lysis of p o pula tion structure: an exa mple from Platanthera leucophaea (Orchidaceae). Mole cular Ec olo gy 13 88 7 –894 . [7] Ibrahim, J. G. , Chen, M.-H. and Sinha, D . ( 2001). Bayesi an Survival Analy sis . Springer, Ber lin. MR18765 98 [8] Millar, R. B. (1 9 87). Maximum likelihoo d estimation of mixed s to ck ﬁshery comp osition. Can. J. Fish. A quat. Sci. 44 5 83–59 0. [9] Milner, G. B. , Teel, D. J. , Utter, F. M. and Winans, G. A. (2 005). A genetic metho d o f sto ck identiﬁcation in mixed p opulations of paciﬁc salmon, oncorhync h us spp. Mar. Fish. R ev. 47 1– 8. [10] Okuy ama, T. an d Bol ker, B. M. (2005 ). Combining genetic and eco logical data to estimate sea tur tle origins . Ec olo gic al Applic ations 15 315– 325. [11] Pella, J. and Masuda, M. (2001 ). Bayesian metho ds for analys is of sto ck mixtures from ge netic c haracters . Fish. Bul l. 99 15 1–16 7. [12] Smouse, P. E . , W aples, R. S . a nd Towrek, J. A. (199 0). A genetic mix- ture analysis for use with incomplete so urce populatio n data. Can. J. Fish. A quat. Sci. 47 62 0 –634 . [13] Spiegelhal ter, D. J. , Best, N. G. , Carlin, B. P. and v an der Linde, A. (2002). Bay esian measures of mo del complexity and ﬁt. J. R oy. Statist. So c. Ser. B 64 58 3–639 . MR19793 80

A hierarchical Bayesian approach for estimating the origin of a mixed population

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment