Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European
Nicholls and Gray (2008) describe a phylogenetic model for trait data. They use their model to estimate branching times on Indo-European language trees from lexical data. Alekseyenko et al. (2008) extended the model and give applications in genetics.…
Authors: ** Robin J. Ryder, Geoffrey K. Nicholls **
Missing data in a sto hasti Dollo mo del for ognate data, and its appliation to the dating of Proto-Indo-Europ ean Robin J. Ryder and Geo K. Ni holls Departmen t of Statistis, Univ ersit y of Oxford, UK Septem b er 18, 2018 Abstrat Ni holls and Gra y (2008 ) desrib e a ph ylogeneti mo del for trait data. They use their mo del to estimate bran hing times on Indo-Europ ean language trees from lexial data. Aleksey enk o et al. (2008 ) extended the mo del and giv e appliations in genetis. In this pap er w e extend the inferene to handle data missing at random. When trait data are gathered, traits are thinned in a w a y that dep ends on b oth the trait and missing-data on ten t. Ni holls and Gra y (2008 ) treat missing reords as absen t traits. Hittite has 12% missing trait reords. Its age is p o orly predited in their ross- v alidation. Our predition is onsisten t with the historial reord. Ni holls and Gra y (2008 ) dropp ed sev en languages with to o m u h missing data. W e t all t w en t y four languages in the lexial data of Ringe et al. (2002 ). In order to mo del spatial-temp oral rate heterogeneit y w e add a atastrophe pro ess to the mo del. When a language passes through a atastrophe, man y traits hange at the same time. W e t the full mo del in a Ba y esian setting, via MCMC. W e v alidate our t using Ba y es fators to test kno wn age onstrain ts. W e rejet three of thirt y historially attested onstrain ts. Our main result is a unimo del p osterior distribution for the age of Proto-indo-Europ ean en tered at 8400 y ears BP with 95% HPD equal 7100-9800 y ears BP . The Indo-Europ ean languages desend from a ommon anestor alled Proto-Indo-Europ ean. Lexial data sho w the patterns of relatedness among Indo-Europ ean languages. These data are ognay lasses: a pair of w ords in the same lass desend, through a pro ess of sound hange, from a ommon anestor. F or example, English se a and German Se e are ognate to one another, but not to the F ren h mer . Gra y and A tkinson (2003 ) o ded data of this kind in a matrix in whi h ro ws orresp ond to languages and olumns to distint ognay lasses, and en tries are zero or one as the language p ossessed or la k ed a term in the olumn lass. They analysed these data using ph ylogeneti algorithms similar to those used for geneti data. Our analysis has the same ob jetiv es, but w e t a mo del designed for lexial trait data. W e w ork with data ompiled b y Ringe et al. (2002 ), reording the distribution of some 872 distint ognay lasses in t w en t y four mo dern and anien t Indo-Europ ean languages. In setion 7, w e giv e estimates for the unkno wn top ology and bran hing times of the ph ylogen y of the ore v o abulary of these languages. Ni holls and Gra y (2008 ) analyse the same data using a losely related sto hasti Dollo-mo del for binary trait ev olution. Ho w ev er, those authors w ere unable to deal with missing trait reords. Missing data arise when w e are unable to answ er the question do es language X p ossess a ognate in ognay lass Y?. Ni holls and Gra y (2008 ) dropp ed sev en languages whi h had man y missing en tries, and treated missing trait reords in the remainder as absen t traits. This is unsatisfatory . Ho w ev er, it is not straigh tforw ard to giv e a mo del-based in tegration of missing data for the trait ev olution mo del of Ni holls and Gra y (2008 ). In this pap er w e in tegrate the missing trait data, and this te hnial adv ane allo ws us to t all t w en t y 1 four of the languages in the original data. The binary trait mo del of Ni holls and Gra y (2008 ), has b een extended b y Aleksey enk o et al. (2008 ) to m utliple-lev el traits, and is nding appliations in biology . A prop er treatmen t of missing data will b e of use in other appliations. W e are sp eially in terested in ph ylogeneti dating. Beause w e are w orking with lexial, and not syn tati data, it is the age of the bran hing of the ore v o abulary of Proto-Indo-Europ ean that w e estimate. This is a on tro v ersial matter. W ork ers in historial linguistis ha v e evidene from linguisti paleon tology that the most reen t ommon anestor of all kno wn Indo-Europ ean languages bran hed no earlier than ab out 60006500 y ears Before the Presen t (BP) (Mallory , 1989 ). F or a reen t review of the argumen t from linguisti paleon tology , and a ritiism of ph ylogeneti dating, see Garrett (2006 ) and MMahon and MMahon (2005 ). An alternativ e h yp othesis suggests that the spread b egan around 8500 BP when the Anatolians mastered farming (Renfrew , 1987 ) in the early neolithi. Reen t eorts to apply quan titativ e ph ylogeneti metho ds to dating Proto-Indo-Europ ean giv e a time depth of 8000 to 9500 y ears BP (Ni holls and Gra y , 2008 ; Gra y and A tkinson , 2003 ), supp orting the link to farming. In this w ork w e obtain a unimo dal p osterior distribution for the age of Proto-Indo-Europ ean en tered at 8400 y ears BP with 95% HPD equal 7100-9800 y ears BP One p oin t of view is that the argumen t from linguisti paleon tology is orret, and the ph ylogeneti dates are inorret, due to rate heterogeneit y , or some other mo del failing. In this resp et w e adv ane the sear h started in Ni holls and Gra y (2008 ) for a mo del misp eiation whi h migh t explain the 20% dierene the en tral age estimates from our tting, and the status quo. The dierene is large enough that w e are hop eful of nding a single oheren t error, if an y su h error exists. A oun ting no w for missing data, w e are able to publish a m u h wider ross-v alidation test (up from 10 to 30 tested alibrations). Also, w e t a mo del for expliit rate heterogeneit y in time and spae. In view of the spatial-temp oral homogeneit y w e nd here and in other data, the ase for the earlier date seems no w fairly strong. The most probable alternativ e seems to b e a step hange in the rate of lexial div ersiation ating in a o ordinated fashion aross the Indo-Europ ean territory some 3000 to 5000 y ears ago. In Setions 1 and 2.1 w e desrib e the data and sp eify a sub jetiv e prior for the ph ylogen y of v o abu- laries. In setions 2.2 and 3, w e giv e a generativ e mo del for the data and the orresp onding lik eliho o d. W e inlude, in Setion 3, a reursiv e algorithm whi h mak es the sum o v er missing data tratable. In setions 4 and 5 w e giv e the p osterior distribution on tree and parameter spae, and briey desrib e our MCMC sampler. In setion 6 , w e disuss lik ely mo del misp eiation senarios, test for robustness b y tting syn theti data sim ulated under su h onditions, and ross-v alidate our preditions. W e t the mo del to a data set of Indo-Europ ean languages in setion 7. This pap er has a supplemen t giving an analysis of a seond data set, olleted b y Dy en et al. (1997 ). Ph ylogeneti metho ds ha v e b een used to mak e inferene for tree ( Ringe et al., 2002 ) and net w ork stru- ture (MMahon and MMahon , 2005 ) in Historial Linguistis. W arno w et al. (2004 ) write do wn a more realisti mo del for the div ersiation of v o abulary , aoun ting for w ord-b orro wing b et w een languages (so that their history need not b e tree-lik e) but there is to date no tting. 1 Desription of the data The data group w ords from 328 meaning ategories and 24 Indo-Europ ean languages in to 872 homology lasses. The data w ere olleted and o ded b y Ringe et al. (2002 ). Meaning ategories o v er the ore v o abulary and are assumed relev an t to all languages in the study . T w o w ords of losely similar meaning, desended from a ommon anestor but sub jet to v ariation in phonology , are o gnate terms. A o gnay lass is a homology lass of w ords all b elonging to a single meaning ategory . F or example, for the meaning head", the Italian testa and the F ren h tête b elong to the same ognay lass, while the English he ad 2 Old English stierfþ Old High German stirbit, touwit A v estan miriiete Old Ch ur h Sla v oni um r et u Latin moritur Osan ? Old English 1 0 0 Old High German 1 1 0 A v estan 0 0 1 Old Ch ur h Sla v oni 0 0 1 Latin 0 0 1 Osan ? ? ? ( a ) ( b ) T able 1: An example of data o ding: (a), the w ord he dies in six Indo-Europ ean languages; (b), the o ding of this data as a binary matrix with ?'s for missing data. In the nota- tion of Setion 1, the rst olumn in (b) is a ognay lass M dies ∈ Ω dies with Ω dies = {{ Old English, Old High German } , { Old English, Old High German, Osan }} and the Sw edish huvud b elong to another ognay lass. An elemen t of a ognay lass is th us a w ord in a partiular language, a sound-meaning pair. These elemen ts are alled o gnates . The v o abulary of a single language is represen ted as a set of distint ognates. In our analysis a ognate has just t w o prop erties: its language and its ognay lass. If there are N distint ognay lasses in data for L languages, then the a 'th lass M a ⊆ { 1 , 2 , ..., L } is a list of the indies of languages whi h p ossess a ognate in that lass. The data are o ded as a binary matrix D . A ro w orresp onds to a language and a olumn to a ognay lass, so that D i,a = 1 if the a 'th ognay lass has an instane in the i 'th language, and D i,a = 0 otherwise. See T able 1 for an example. This o ding allo ws a language to ha v e sev eral w ords for one meaning (su h as Old High German stirbit and touwit for he dies", an instane of p olymorphism), or no w ord at all. Missing matrix elemen ts arise b eause the reonstruted v o abularies of some anien t languages are inomplete. If w e are unable to answ er the question do es language i p ossess a ognate in ognay lass a then w e set D i,a =? . W e need notation for b oth matrix and set represen tations with missing data. Denote b y B a olumn a of L × N matrix B . F or a = 1 , 2 , ..., N let D a b e the set of all olumn v etors d ∗ allo w ed b y the data D a in olumn a of D , D a = d ∗ ∈ { 0 , 1 } L : D i,a ∈ { 0 , 1 } ⇒ d ∗ a = D i,a , i = 1 , 2 , ..., L . F or d ∗ ∈ D a let m ( d ∗ ) = { i : d ∗ i = 1 } . Denote b y Ω a the set of ognay lasses onsisten t with the data D a , so that Ω a = { ω ⊆ { 1 , 2 , . . . , N } : ω = m ( d ∗ ) , d ∗ ∈ D a } . The data D are then equiv alen tly Ω = (Ω 1 , Ω 2 , ..., Ω N ) . The Ω a -notation generalizes the M a -notation to handle missing data. It is illustrated in T able 1. Ringe et al. (2002 ) list t w en t y four mostly anien t languages. F or elev en of these languages (Latin, Mo dern Latvian, Old Norse...), all the matrix en tries are reorded. F or the rest, the prop ortion of missing en tries v aries b et w een 1% (for Old Irish) and 91% (for Lyian, an anien t language of Anatolia). Note that data are usually missing in small blo ks orresp onding to the ognay lasses for a giv en meaning ategory , as in T able 1. W e do not mo del this asp et of the missing data. This is related to the mo del-error Ni holls and Gra y (2008 ) all `the empt y-eld appro ximation', under whi h ognay lasses in the same meaning ategory are assumed to ev olv e indep enden tly . 3 F or these Indo-Europ ean ph ylogenies, the top ology of some subtrees are kno wn from historial reords. W e ha v e lo w er and upp er b ounds for the age of the ro ot no de of some subtrees. F or example, the Sla vi languages are kno wn to form a subtree, and the most reen t ommon anestor of the Sla vi languages in the data is kno wn to b e at least 1300 y ears old. In this w a y in ternal no des of the ph ylogen y are onstrained. W e ha v e age b ounds for leaf no des as w ell, sine w e are told when the v o abulary of the anien t languages in these data w as in use. W e om bine these alibr ation onstrain ts with the ognay lass data in setion 2.1 . Jumping ahead to our results, Figure 5 is a sample from the p osterior distribution w e nd for ph ylogenies. Calibration onstrain ts are represen ted b y the bla k bars aross no des in this tree. 2 Mo dels W e sp eify a sub jetiv e prior for ph ylogenies, represen ting a state of kno wledge of in terest to us. The mo del w e giv e for the div ersiation of v o abulary in Setion 2.2 is, in on trast, generativ e. W e mo del the div ersiation of v o abulary as a bran hing pro ess of sets of ognates, ev olving on the ph ylogen y . Lea v es are lab eled b y languages and a bran h represen ts the anestral lineage of a v o abulary . 2.1 Prior distribution on trees The material in this subsetion follo ws Ni holls and Gra y (2008 ). W e onsider a ro oted binary tree g with 2 L no des: L lea v es, L − 2 in ternal no des, a ro ot no de r = 2 L − 1 and an A dam no de A = 2 L , whi h is link ed to r b y a edge of innite length. Ea h no de i = 1 , 2 , ..., 2 L is assigned an age t i and t = ( t 1 , t 2 , ..., t A ) ; the units of age are y ears b efore the presen t; for the A dam no de, t A = + ∞ . The edge b et w een paren t no de j and hild no de i is a direted bran h h i, j i of the ph ylogen y , with the orderings i < j and t i < t j . Let E b e the set of all edges, inluding the edge h r , A i , let V b e the set of all no des and V L = { 1 , 2 , ..., L } the set of all leaf no des. Our base tree spae Γ is the set of all ro oted direted binary trees g = ( E , V , t ) , with distinguishable lea v es and, for i = 1 , 2 , ..., 2 L , no de ages t i > 0 assigned so that the direted path from a leaf i ∈ V L to no de A passes through no des of stritly inreasing age. With this n um b ering on v en tion, ( E , V ) is alled the ordered history of a ro oted direted binary tree. W e allo w for C alibration onstrain ts on tree-top ology and seleted no de ages. These are desrib ed at the end of Setion 1 . Ea h su h onstrain t c allo ws trees in just some subspae Γ ( c ) ⊂ Γ of tree spae. W e add to these onstrain ts an upp er b ound on the ro ot time at some age T > 0 , and set Γ (0) = { ( E , V , t ) ∈ Γ : t r ≤ T } . The spae of alibrated ph ylogenies with atastrophes is then Γ C = C \ c =0 Γ ( c ) . The ro ot age is a sensitiv e statisti in this inferene. A prior distribution on trees, with the prop ert y that the marginal distribution of the ro ot age is uniform o v er a xed prior range t L ≤ t r ≤ T , is of in terest. F or no de i ∈ V , let t + i = sup g ∈ Γ C t i and t − i = inf g ∈ Γ C t i b e the greatest and least admissible ages for no de i , and let S = { i ∈ V : t + i = T } , so that S is the set of no des ha ving ages not b ounded ab o v e b y a alibration (there are 12 su h no des in Figure (5 )). Ni holls and Gra y (2008 ) sho w that, b efore alibration (when C = 0 ) and for tree spaes in whi h the lea v es ha v e xed equal ages, the prior probabilit y distribution with densit y f G ( g | T ) ∝ Y i ∈ S ( t r − t − i ) − 1 4 giv es a marginal densit y for t r whi h is exatly uniform in t L < t r < T . Ni holls and Gra y (2008 ) do not ommen t on the distribution determined b y f G o v er tree top ologies. The prior f G has in this ase ( C = 0 ) a uniform marginal distribution on ordered histories exatly equal to the orresp onding distribution for the Y ule mo del. This is not uniform, but fa v ors balaned leaf-lab eled top ologies. F or C > 0 , and on atastrophe-free tree spaes in whi h leaf ages are onstrained b y prior kno wledge to lie in an in terv al only , Ni holls and Gra y (2008 ) argue from sim ulation studies that this prior giv es a reasonably at marginal distribution for t r if in addition T ≫ max i ∈ V \ S t + i (the greatest upp er b ound among the alibration onstrain ts is not to o lose to the upp er b ound on the ro ot age). W e giv e a sample from the prior in the Supplemen tary Material; the prior do es not represen t an y reasonable prior b elief b efore ab out 4500 BP , but this is immaterial as the lik eliho o d rules these v alues out (an instane of an appliation of the priniple of suien t reason). 2.2 Div ersiation of ognay lasses In this subsetion w e extend the sto hasti Dollo mo del of Ni holls and Gra y (2008 ) to inorp orate rate heterogeneit y in time and spae, via a atastrophe pro ess. Consider the ev olution of ognates do wn the anestral lineage of the v o abulary of a single language. An example is giv en in Figure 1 . A new ognay lass is b orn when its rst ognate is b orn. This new w ord is not ognate with other w ords in the mo deled pro ess. Loan w ords from outside the ore meaning ategories of an y language in the study , or from a language outside the study , ma y b e go o d for w ord birth without violating this ondition. A ognate dies in a partiular v o abulary when it eases to b e used for the meaning ommon to its ognay lass: its meaning ma y hange, or it ma y fall out of use. Cognates ev olving in a single language ( ie do wn a single bran h of a language ph ylogen y) are b orn indep enden tly at rate λ , die indep enden tly at p er apita rate µ , and are sub jet to p oin t-lik e atastrophes, whi h they enoun ter at rate ρ along a bran h. A t a atastrophe, ea h ognate dies indep enden tly with probabilit y κ , and a P oisson n um b er of ognates with mean ν are b orn. A t a bran hing ev en t of the ph ylogen y , the set of ognates represen ting the bran hing v o abulary is opied in to ea h of the daugh ter languages. See Figure 1 . The pro ess w e ha v e desrib ed is not rev ersible, and this greatly ompliates the analysis. It seems aeptable, from a data-mo deling p ersp etiv e, to imp ose the ondition ν = κλ/µ , whi h is neessary and suien t for rev ersibilit y (see Supplemen tary Material for a pro of ). Under this ondition, adding a atastrophe to an edge is equiv alen t to lengthening that edge b y T C ( κ, µ ) = − lo g(1 − κ ) /µ y ears. This follo ws b eause the n um b er of ognates generated b y the anageni part of the pro ess in an in terv al of length T C is P oisson distributed with mean λ µ (1 − e − µT C ) equal to κλ/µ , and the probabilit y that a ognate en tering an in terv al of length T C dies during that in terv al is 1 − e − µT C , whi h equals κ . Beause a atastrophe simply extends its edge b y a blo k of virtual time, the lik eliho o d dep ends only on the n um b er of atastrophes on an edge, and not their lo ation in time. Let k i b e the n um b er of atastrophes on edge h i, j i , and k = ( k 1 , . . . , k 2 L − 2 ) b e the atastrophe state v etor. W e reord no atastrophes on the h R, A i edge (its length is already innite). The tree g = ( V , E , t, k ) is sp eied b y its top ology , no de ages and atastrophe state. Calibrated tree spae extended for atastrophes is Γ C K = { ( V , E , t, k ) : ( V , E , t ) ∈ Γ C , k ∈ N 2 L − 2 0 } . W e drop the atastrophe pro ess from the alulation in Setion 3. It is straigh tforw ard to restore it, and w e do this in the expression for the p osterior distribution in Setion 4 . 5 {1,2} −2 +12,13 −1 +9,10 −8,9 +11 +14 −2 +6 −12 −12 +5 −1 +3 −3 +8 +7 {1} {1,3,14} {10,11} {5} {5} {1,6,7,13} {1,6,13} {1,13} Figure 1: Desription of the mo del: births and deaths of traits are mark ed. The dots orresp ond to atastrophes, at whi h m ultiple births and deaths ma y o ur sim ultaneously . The ognate sets this generates at lea v es are sho wn on the righ t. Calendar time o ws from left to righ t and the age v ariables t i in the text inrease from righ t to left. 2.3 The registration pro ess Let D ∗ denote a notional ful l random binary data matrix, represen ting the outome of the div ersiation pro ess of Setion 2.2 . The n um b er of olumns in D ∗ is random, and equal to N ∗ . F or the realization depited in Figure 1 , D ∗ = D ∗ with D ∗ displa y ed in Figure 2. Note that D ∗ has a olumn for ea h ognay lass presen t at the ro ot no de, or b orn b elo w it, whether or not the ognay lass has an y ognates represen ted in an y leaf languages. W e all the random mapping of the unkno wn full represen tation D ∗ to the registered data D , whi h is a random matrix with N olumns, the r e gistr ation pro ess. There are t w o stages to this pro ess: the masking of matrix elemen ts of the fully realised data D ∗ with ?'s to form an in termediate data matrix ˜ D , and the seletion of olumns from ˜ D to determine the realised data D . Let I ∗ b e a random L × N ∗ indiator matrix of indep enden t Bernoulli random v ariables for observ ed elemen ts. The zeros of I ∗ mark matrix en tries in D ∗ whi h will b e unobserv able. F or i = 1 , 2 , ..., L and giv en a = 1 , 2 , ..., N ∗ , let ξ i = Pr( I ∗ i,a = 1) . The probabilit y ξ i that w e an answ er the question, do es language i p ossess a w ord in the a 'th ognay lass?, is assumed to b e a funtion of the language index i 6 [ D* ℄ [ I* ℄ [ D~ ℄ [ I ℄ [ D ℄ 10000000000000 11111111111111 10000000000000 1 111 1 11 1 000 0 00 10100000000001 01011111111111 ?0?00000000001 0 111 1 11 ? 000 0 01 00000000011000 10111001110110 0?000??001?00? 1 100 1 10 0 0?? 1 0? 00001000000000 11111111111111 00001000000000 1 111 1 11 0 100 0 00 00001000000000 11110111111111 0000?000000000 1 011 1 11 0 ?00 0 00 10000110000010 10111111111111 1?000110000010 1 111 1 11 1 011 0 10 10000100000010 11111111111111 10000100000010 1 111 1 11 1 010 0 10 10000000000010 11111111111100 100000000000?? 1 111 1 00 1 000 0 ?? Figure 2: Registration of the v o abulary realized in Figure 1 supp osing a masking matrix I ∗ , as ab o v e. D ∗ is the unobserv ed full data with a olumn for ea h ognay lass and a ro w for ea h language of Figure 1 (th us ognate 1 is presen t in ro ws 1,2,6,7 and 8); zeros in the masking matrix I ∗ indiate missing matrix elemen ts. Some ognay lasses are then thinned (in this example, the registration rule k eeps ognay lasses with instanes displa y ed in one or more language) to giv e the registered data D . only . If w e get an answ er, it is assumed orret. Let ξ = ( ξ 1 , . . . , ξ L ) and denote b y I ∗ a realisation of I ∗ . Denote b y ˜ D = ˜ D ( D ∗ , I ∗ ) the mask ed v ersion of the full random data matrix: if I ∗ i,a = 1 then ˜ D i,a = D ∗ i,a and if I ∗ i,a = 0 then ˜ D i,a =? . Matrix olumns ma y b e missing to o, so that N ≤ N ∗ . W e get missing olumns ev en when there are no missing data. The matrix ˜ D in Figure 2 inludes some olumns with only zeros and question marks, orresp onding to ognay lasses whi h existed in the past but are observ ed in none of the leaf-languages from whi h the data w ere ompiled. These ognay lasses are not inluded in the registered data. Denote b y R the registration rule D = R ( ˜ D ) mapping the full data to registered data. Rule R ma y thin additional olumn t yp es. Let Y and Q b e funtions of the olumns of D ∗ and I ∗ oun ting the visible 1's and ?'s resp etiv ely , Y ( D ∗ a , I ∗ a ) = L X i =1 I ∗ i,a D ∗ i,a , Q ( I ∗ a ) = L X i =1 (1 − I ∗ i,a ) . Giv en a = 1 , 2 , ..., N ∗ , let Y a = Y ( D ∗ a , I ∗ a ) and Q a = Q ( I ∗ a ) . In App endix A w e giv e an eien t algorithm for omputing the lik eliho o d for rules formed b y om- p ounding the follo wing elemen tary thinning op erations: (1) R 1 ( ˜ D ) = ( ˜ D a : Y a > 0) (disard lasses with no instanes at the lea v es); (2) R 2 ( ˜ D ) = ( ˜ D a : Y a > 1) (disard lasses - singletons - observ ed at a single leaf ); (3) R 3 ( ˜ D ) = ( ˜ D a : Y a < L ) (disard lasses whi h are observ ed at all lea v es); (4) R 4 ( ˜ D ) = ( ˜ D a : Y a < L − 1) (disard lasses whi h are observ ed at all lea v es or at all lea v es but one); (5) R 5 ( ˜ D ) = ( ˜ D a : Y a + Q a < L ) (disard lasses whi h are p oten tially presen t at all lea v es); 7 (6) R 6 ( ˜ D ) = ( ˜ D a : Y a + Q a < L − 1) (disard lasses whi h are p oten tially presen t at all lea v es or at all lea v es but one). W e assume the hosen rule inludes Condition (1). The rule D = R ( ˜ D ) with R ( ˜ D ) = R 4 ◦ R 2 ( ˜ D ) ollets parsimon y informativ e ognay lasses. Ronquist et al. (2005 ) giv e the lik eliho o d for the nite- sites trait ev olution mo del of Lewis (2001 ) for registration rules lik e (1-6). In the example in Figure 2 , and in Setions 6.2 and 7) w e t data registered with R ( ˜ D ) = R 1 ( ˜ D ) . The seletion of olumns is something w e ha v e in general no on trol o v er: the olumn seletion rule simply desrib es what happ ened at registration. Results in the Supplemen tary Material for the Dy en et al. (1997 ) data use the rule R ( ˜ D ) = R 2 ( ˜ D ) , sine singleton olumns w ere not inluded in that data. Ho w ev er, ertain olumn t yp es ma y inlude data whi h is hard to mo del w ell, and so w e ma y ho ose to mak e further thinning using the other rules. Reursions for the other rules are giv en in an App endix. Column indies a = 1 , 2 , ..., N ∗ are ex hangeable. It is on v enien t to ren um b er the olumns of D ∗ , I ∗ and ˜ D after registration, so that ˜ D a = D a and I ∗ a = I a for a = 1 , 2 , ..., N . The information needed to ev aluate Y a and Q a is a v ailable in the olumn D a and set Ω a represen tations. W e write Y ( D a ) = Y (Ω a ) = Y a and Q ( D a ) = Q (Ω a ) = Q a . 2.4 P oin t pro ess of births for registered ognay lasses Fix a atastrophe-free ph ylogen y g ∈ Γ C K , with k = (0 , 0 , ..., 0) , and let an edge h i, j i and a time τ ∈ [ t i , t j ) b e giv en. Denote b y [ g ] the set of all p oin ts ( τ , i ) on the ph ylogen y , inluding p oin ts ( τ , r ) with τ ≥ t r in the edge h r , A i . The lo ations z D = { z 1 , z 2 , ..., z N } of the birth ev en ts of the N r e gister e d ognay lasses are a realization of an inhomogeneous P oisson p oin t pro ess Z D on [ g ] . Let Z ∈ [ g ] b e the birth lo ation of a generi (and p ossibly unregistered) ognay lass M ⊆ { 1 , 2 , ..., L } , orresp onding to a olumn of ˜ D with Y observ ed 1 's and Q ? 's, and let E Z b e the ev en t that this lass generates a olumn of the registered data. F or the a 'th ognay lass M a , b orn at Z a , this ev en t is E Z a = { D a = R ( ˜ D a ) } sine R ( ˜ D a ) is empt y for ˜ D a a olumn dropp ed at registration. Cognay lasses are b orn at onstan t rate on the bran hes of g , but are thinned b y registration. Ho w ev er, onditional on the birth lo ation Z a = z a , our mo deling assumes that the outome E Z a for the a 'th ognay lass is deided indep enden tly of all ev en ts in all other ognay lasses. The p oin t pro ess Z D of birth lo ations of registered ognay lasses has in tensit y ˜ λ ( z ) = λ Pr( E Z | g , µ, λ, ξ , Z = z ) at z ∈ [ g ] and probabilit y densit y f Z D ( z D ) = 1 N ! e − Λ([ g ]) N Y a =1 ˜ λ ( z a ) with resp et to the elemen t of v olume dz D = dz 1 dz 2 ...dz N in [ g ] N , where Λ([ g ]) = Z [ g ] ˜ λ ( z ) dz = X h i,j i∈ E Z t j t i ˜ λ (( τ , i )) dτ . The n um b er N of registered ognay lasses is N ∼ P oisson (Λ([ g ])) . 8 3 Lik eliho o d alulations W e giv e the lik eliho o d for g , µ , λ , κ , ρ and ξ giv en the data, along with an eien t algorithm to ompute the sum o v er all missing data. The atastrophe pro ess is left out, and reinorp orated in the next setion. Sine w e only ev er see registered data, the lik eliho o d for g , λ, µ and ξ is the probabilit y P [ D = D | g , µ, λ, ξ , D = R ( ˜ D )] , to get data D giv en the parameters and onditional on the data ha ving passed registration. W e restore the birth lo ations (and so omit λ from the onditioning), and fatorize using the join t indep endene of D a , a = 1 , 2 , ..., N under the giv en onditions: P [ D = D | g , µ, λ, ξ , D = R ( ˜ D )] = Z f Z D ( z D ) P [ D = D | g , µ, ξ , Z D = z D , D = R ( ˜ D )] dz D , = e − Λ([ g ]) N ! N Y a =1 λ Z [ g ] P [ E Z a | g , µ, ξ , Z a = z a ] P [ D a = D a | g , µ, ξ , Z a = z a , E Z a ] dz a , = e − Λ([ g ]) N ! N Y a =1 λ Z [ g ] P [ D a = D a | g , µ, ξ , Z a = z a ] dz a . The last line follo ws b eause P [ E Z a | D a = D a , ..., Z a = z a ] = 1 for D a a olumn of registered data: if the outome of the birth at z a w as the registerable data D a then the ev en t E Z a ertainly o urs. The lik eliho o d dep ends on the a wkw ard ondition D = R ( ˜ D ) through the mean n um b er λ ([ g ]) of registered ognay lasses only , while P [ D a = D a | g , µ, ξ , Z a = z a ] is the probabilit y to realise the data v etor D a in the unonditioned div ersiation/missing elemen t pro ess. The alulation has so far extended Ni holls and Gra y (2008 ) to giv e the lik eliho o d for a greater v ariet y of olumn thinning rules. W e no w add the missing elemen t omp onen t of the registration pro ess. W e sum o v er p ossible v alues of the missing matrix elemen ts in the registered data. Sine P [ D a = D a | g , µ, λ, ξ , Z a = z a ] is not onditioned on the requiremen t that the olumn D a gets registered, the en tries of the orresp onding olumn I a are determined b y the unonditioned Bernoulli pro ess, and w e ha v e P [ D a = D a | g , µ, ξ , Z a = z a ] = X d ∗ ∈D a P [ I ∗ a , D ∗ a = d ∗ | g , µ, ξ , Z a = z a ] = L Y i =1 ξ I i,a i (1 − ξ i ) 1 − I i,a X d ∗ ∈D a P [ D ∗ a = d ∗ | g , µ, ξ , Z a = z a ] . The lik eliho o d is P [ D = D | g , µ, λ, ξ , D = R ( ˜ D )] = e − Λ([ g ]) N ! N Y a =1 ( L Y i =1 ξ I i,a i (1 − ξ i ) 1 − I i,a ) λ Z [ g ] X ω ∈ Ω a P [ M = ω | g , µ, ξ , Z = z a ] dz a , (1) where w e ha v e swit hed no w from summing d ∗ ∈ D a to the equiv alen t set represen tation ω ∈ Ω a . F or the t w o in tegrated quan tities in Equation (1) w e ha v e tratable reursiv e form ulae. W e are using a pruning pro edure akin to F elsenstein (1981 ). W e b egin with Λ([ g ]) . W e assume the registration rule inludes at least Condition (1). It follo ws that a ognay lass b orn at Z = ( τ , i ) in [ g ] m ust surviv e do wn to the no de b elo w, at Z = ( t i , i ) , in order to b e registered, and so P [ E Z | Z = ( τ , i ) , g , µ, ξ ] = P [ E Z | Z = ( t i , i ) , g , µ, ξ ] e − µ ( τ − t i ) . 9 W e an substitute this in to the expression for Λ([ g ]) , and in tegrate, to get Λ([ g ]) = λ µ X h i,j i∈ E P [ E Z | Z = ( t i , i ) , g , µ, ξ ] 1 − e − µ ( t j − t i ) . (2) Giv en a no de i , let V ( i ) L b e the set of leaf no des desended from i , inluding i itself if i is a leaf. Let s i = ard ( V ( i ) L ) . Denote b y u ( n ) i = P [ Y = n | Z = ( t i , i ) , g , µ, ξ ] and v ( n ) i = P [ Y + Q = n | Z = ( t i , i ) , g , µ, ξ ] . W e an ompute Λ for rules made up of om binations of Condition (1) with an y om bination of Conditions (2-6), from u (0) i , u (1) i , u ( s i − 1) i , u ( s i ) i , v ( s i − 1) i and v ( s i ) i . F or example, P [ E Z | Z = ( t i , i ) , g , µ, ξ ] = 1 − u (0) i R = R 1 , 1 − u (0) i − u (1) i R = R 2 , 1 − u (0) i − u (1) i − u ( L − 1) i − u ( L ) i R = R 3 ◦ R 2 . } (3) Notie that u ( n ) i = 0 unless s i ≥ n , so for example u ( L ) i is non-zero at i the ro ot no de only . Sine our main appliation is for data registered under Condition (1), w e giv e, in the b o dy of this pap er, reursions for u (0) i and v (0) i only . See App endix A for the reursions needed to ev aluate the lik eliho o d under rules in v olving Conditions (2-6). F or no des i and j , let δ i,j = e − µ ( t j − t i ) . Consider a pair of edges h c 1 , i i , h c 2 , i i in E . u (0) i = (1 − δ i,c 1 ) + δ i,c 1 u (0) c 1 (1 − δ i,c 2 ) + δ i,c 2 u (0) c 2 (4) v (0) i = δ i,c 1 v (0) c 1 + (1 − δ i,c 1 ) Y j ∈ V c 1 L ξ j δ i,c 2 v (0) c 2 + (1 − δ i,c 2 ) Y j ∈ V c 2 L ξ j (5) The reursion is ev aluated from the lea v es i ∈ V L , at whi h u (0) i = 1 − ξ i (6) v (0) i = 0 (7) W e no w giv e the equiv alen t reursions for λ R [ g ] P ω ∈ Ω a P [ M = ω a | Z = z a , g , µ ] dz a . Consider the set m a = T ω ∈ Ω a ω of lea v es known to ha v e a ognate in the a th registered ognay lass. Let E a b e the set of bran hes on the path from the most reen t ommon anestor of the lea v es in m a up to the A dam-no de A ab o v e the ro ot. Cognay lass a m ust ha v e b een b orn on an edge in E a . F or a = 1 , 2 , ..., N , lass M a is non-empt y and w e an shift the birth lo ation to the no de b elo w and on v ert the in tegral to a sum, λ Z [ g ] X ω ∈ Ω a P [ M = ω | Z = z a , g , µ ] dz a = λ µ X h i,j i∈ E a X ω ∈ Ω a P [ M = ω | Z = ( t i , i ) , g , µ ](1 − δ i,j ) . F or ea h a = 1 , 2 , ..., N and ω ∈ Ω a , let ω ( i ) = ω ∩ V ( i ) L and Ω ( i ) a = { ω ( i ) : ω ( i ) = ω ∩ V ( i ) L , ω ∈ Ω a } denote the set of all subsets ω ( i ) of the lea v es V ( i ) L whi h are ognay lasses onsisten t with the data a v ailable for those lea v es. Consider t w o hild bran hes h c 1 , i i and h c 2 , i i at no de i . Sine Ω a = Ω ( c 1 ) a × Ω ( c 2 ) a , 10 and ev en ts are indep enden t along the t w o bran hes, X ω ∈ Ω a P [ M = ω | Z = ( t i , i ) , g , µ, ξ ] = X ω ( c 1 ) ∈ Ω ( c 1 ) a P [ M = ω ( c 1 ) | Z = ( t i , c 1 ) , g , µ ] × X ω ( c 2 ) ∈ Ω ( c 2 ) a P [ M = ω ( c 2 ) | Z = ( t i , c 2 ) , g , µ ] . Ha ving mo v ed the birth ev en t at ( t i , i ) to ( t i , c 1 ) and ( t i , c 2 ) (o the no de and on to its hild edges) w e no w mo v e the birth ev en t at ( t i , c ) to ( t c , c ) (do wn an edge) as follo ws: X ω ∈ Ω ( c ) a P [ M = ω | Z = ( t i , c ) , g , µ ] = δ i,c × X ω ∈ Ω ( c ) a P [ M = ω | Z = ( t c , c ) , g , µ ] if Y (Ω ( c ) a ) ≥ 1 (1 − δ i,c ) + δ i,c × X ω ∈ Ω ( c ) a P [ M = ω | Z = ( t c , c ) , g , µ ] if Y (Ω ( c ) a ) = 0 and Q (Ω ( c ) a ) ≥ 1 (1 − δ i,c ) + δ i,c v (0) c if Y (Ω ( c ) a ) + Q (Ω ( c ) a ) = 0 ( i.e. Ω ( c ) a = {∅ } ) The reursion is ev aluated from the lea v es. If c is a leaf, then X ω ∈ Ω ( c ) a P [ M = ω | Z = ( t c , c ) , g , µ ] = ( 1 if Ω ( c ) a = {{ c } , ∅} or {{ c }} ( i.e. D c,a ∈ { ? , 1 } ) 0 if Ω ( c ) a = {∅ } ( i.e. D c,a = 0 ) . In order to restore atastrophes to this alulation, and giv en g ∈ Γ K , with k i atastrophes on edge h i, j i ∈ E , replae t j − t i with t j − t i + k i T C ( κ, µ ) throughout. 4 P osterior distribution Our prior on the birth rate λ , death rate µ and atastrophe rate ρ is p ( λ, µ, ρ ) ∝ 1 λµρ and w e tak e a uniform prior o v er [0 , 1] for the death probabilit y at a atastrophe κ and ea h missing data parameter ξ i . Substituting using equations (2)-(3 ) in to equation ( 1) and m ultiplying b y the prior f G ( g | T ) p ( λ, µ, ρ ) , w e obtain the p osterior distribution p ( g , µ, λ, κ, ρ, ξ | D = D ) = 1 N ! λ µ N exp − λ µ X h i,j i∈ E P [ E Z | Z = ( t i , i ) , g , µ, κ, ξ ](1 − e − µ ( t j − t i + k i T C ) ) × N Y a =1 X h i,j i∈ E a X ω ∈ Ω a P [ M = ω | Z = ( t i , i ) , g , µ ](1 − e − µ ( t j − t i + k i T C ) ) × 1 µλρ f G ( g | T ) e − ρ | g | ( ρ | g | ) k T k T ! L Y i =1 (1 − ξ i ) Q i ξ N − Q i i (8) 11 for parameters µ, λ, κ, ρ > 0 , 0 < ξ i < 1 and trees g ∈ Γ ( C ) K . The p osterior is improp er without b ounds on ρ sine k T = 0 is allo w ed. W e plae v ery onserv ativ e b ounds on ρ . Results are not sensitiv e to this hoie. W e an sho w that, for 't ypial' data sets D , and in partiular the data analysed b elo w, the p osterior is then prop er. Details of the relationship b et w een ognate lasses and alibration onstrain ts pla y a role in the onditions for the p osterior distribution to b e prop er. 5 Mark o v Chain Mon te Carlo W e use Mark o v Chain Mon te Carlo to sample the p osterior distribution and estimate summary statistis. If the prior on the ognay lass birth rate parameter λ has the onjugate form λ u exp( − v λ ) then the onditional distribution of λ in the p osterior distribution ab o v e has the form λ N + u exp( − λ ( X + v )) . W e to ok the improp er prior u = − 1 , v = 0 for λ and in tegrated. The MCMC state is then x = (( E , V , t, k ) , µ, κ, ρ, ξ ) and the target distribution p ( x | D ) is the densit y obtained b y in tegrating the densit y in Equation 8 o v er λ . The MCMC sampler desrib ed in Ni holls and Gra y (2008 ) has state x = (( E , V , t ) , µ ) . W e add to the ( E , V , t ) - and µ -up dates of Ni holls and Gra y (2008 ) further MCMC up dates ating on the atastrophe v etor k = ( k 1 , ..., k L − 2 ) , on the atastrophe parameters κ, ρ and on ξ , the probabilit y parameter for observ able data-matrix elemen ts. The atastrophe rate parameter ρ is added to time-saling up dates in whi h subsets of parameters are sim ultaneously saled b y a ommon random fator s : if θ is a parameter in the saled subset ha ving units [ time ] u , then θ → s u θ . The probabilit y 0 < ξ i < 1 for an elemen t of the registered data matrix to b e observ able is, for man y leaf-languages, lose to one, so w e up date those parameters b y saling 1 − ξ i . W e inorp orate up dates adding and deleting atastrophes (the lled dots mark ed on the bran hes of Figure 1 ) plus an up date whi h mo v es a atastrophe from an edge to a paren t, hild or sibling edge. F or the addition and deletion of atastrophes, w e do not need to use rev ersible jump Mark o v Chain Mon te Carlo, as the state v etor sp eies the n um b ers, and not the lo ations, of atastrophes on edges. W e omit the details of these mo v es but giv e, as an example, the up date that mo v es a atastrophe from an edge to a paren t, hild or sibling edge. Let k T = P L − 2 i =1 k i giv e the total n um b er of atastrophes. Giv en a state x = ( g , µ, κ, ρ, ξ ) with g = ( V , E , t, k ) , w e pi k edge h i, j i ∈ E with probabilit y k i /k T . Let E h i,j i b e the set of edges neighb ouring edge h i, j i ( hild, sibling and paren t edges, but exluding the edge h R, A i ) and let q i = ard ( E h i,j i ) . W e ha v e in general q i = 4 . Ho w ev er, for i the index of a leaf no de, q i = 2 (1 paren t, 1 sibling, no hildren). If j is the ro ot and i is non-leaf, then q i = 3 (1 sibling, 2 hildren) and if j is the ro ot and i is a leaf w e ha v e q i = 1 (a sibling edge). Cho ose a neigh b ouring edge h ˜ i, ˜ j i uniformly at random from E h i,j i and mo v e one atastrophe from h i, j i to h ˜ i, ˜ j i . The andidate state is x ′ = (( V , E , t, k ′ ) , µ, κ, ρ ) , with k ′ i = k i − 1 and k ′ ˜ i = k ˜ i + 1 and k ′ j = k j for j 6 = i, ˜ i . This mo v e is aepted with probabilit y α ( x ′ | x ) = min 1 , q i k ′ ˜ i p ( x ′ | D ) q ˜ i k i p ( x | D ) . W e assessed on v ergene with the asymptoti b eha viour of the auto orrelation for the parameters µ , κ , ρ and t R , as suggested b y Gey er (1992 ). This metho d indiated that w e ould use runs of ab out 10 million samples; w e also let the MCMC run for 100 million samples and he k ed that the omputed statistis did not v ary . 12 6 V alidation W e made a n um b er of tests using syn theti data. Fitting the mo del to syn theti data sim ulated aording to the lik eliho o d P ( D = D | g , µ, λ, ρ, κ, ξ , D = R ( ˜ D )) , (in-mo del data), sho ws us just ho w informativ e the data is for atastrophe plaemen t, as w ell as making a debug- he k on our implemen tation. W e t out-of-mo del data also. These are syn theti data sim ulated under lik ely mo del-violation senarios, and are used to iden tify soures of systemati bias. W e summarise results for syn theti data sim ulated using the parameter v alues w e estimate in setion 7 on the data. F or in-mo del data w e orretly reonstrut top ology , ro ot age and the n um b er and p osition of atastrophes. F urther details are giv en in the Supplemen tary Material. 6.1 Mo del mis-sp eiation Ni holls and Gra y (2008 ) use out-of-mo del data represen ting un-mo deled loan-w ords (alled b orr owing ), rate-heterogeneit y in time and spae, rate heterogeneit y aross ognay lasses, and the empt y-eld ap- pro ximation. They disuss also mo del mis-sp eiation due to missing data and the inorret iden tiation of ognay lasses (in partiular, a hazard for deeply ro oted lasses to b e split). Rate-heterogeneit y in time and spae, and missing data are no w part of the in-mo del analysis. W e syn thesized out-of-mo del data with the b orro wing mo del of Ni holls and Gra y (2008 ), in order to see if un-mo deled b orro wing biased our results. F or what Ni holls and Gra y (2008 ) all lo w to mo derate lev els of b orro wing, w e w ere able to reonstrut true parameter v alues w ell. See the Supplemen tary Material for details. With high lev els of b orro wing, w e under-estimate the ro ot age and o v er-estimate rate parameters. Ho w ev er, uniden tied loan w ords in the registered data do not generate mo del misp eiation unless they are opies of ognates whi h o ur in the observ ed meaning ategories of languages anestral to the leaf-languages. It is not plausible that uniden tied b orro wing of this kind is presen t at high lev els. W e ha v e not rep eated the Ni holls and Gra y (2008 ) out-of-mo del analysis of rate heterogeneit y aross ognay lasses. In a real v o abulary , distint ognay lasses that share a meaning ategory w ould not ev olv e indep en- den tly . Also, a real language migh t b e exp eted to a p ossess a w ord in ea h of the ore meaning ategories. This onstrains the n um b er of ognay lasses in ea h meaning ategory to b e non-zero. Our mo del allo ws empt y meaning ategories. Ni holls and Gra y (2008 ) nd no substan tial bias in a t to out-of-mo del data resp eting the onstrain t. Our treatmen t of missing data in tro dues a new mis-sp eiation. W e mo deled matrix elemen ts as missing indep enden tly . Ho w ev er, w e get missing data when w e do not kno w the w ord used in a giv en language to o v er a giv en meaning ategory . Matrix elemen ts are as a onsequene t ypially missing in blo ks orresp onding to all the ognay lasses for the giv en meaning. Beause the ages of p o orly reonstruted languages are w ell predited in the ross-v alidation study b elo w, w e ha v e not lo ok ed further at this issue. 6.2 Predition tests for alibration In the next setion w e estimate the age of a tree no de (the ro ot). In this setio n w e test to see if the unertain ties w e estimate are reliable. The alibration data desrib ed in Setion ( 1) x the top ologies and ro ot ages of the subtrees mark ed with bars in Figure 5). W e remo v e ea h alibration onstrain t in turn and use the data and the remaining onstrain ts to estimate the age of the onstrained no de, and the probabilit y for the onstrained subtree top ology . The top ologial onstrain ts w ere all p erfetly reonstruted. In t w en t y three of the t w en t y eigh t 13 tests the onstrained age in terv al o v erlaps the 95% HPD in terv al, as sho wn in the b ottom half of Figure 3 . Ho w go o d or bad are ea h of these preditions? Lo oking at the Hittite predition in Figure 3, the large predition unertain t y only just allo ws the alibration in terv al. Is this bad? W e quan tify the go o dness- of-t, for ea h alibration, using Ba y es fators to replae p -v alues as indies of mist. F or ea h onstrain t c = 1 , 2 , ..., C w e ompute a Ba y es fator measuring the supp ort for the fully onstrained mo del ompared to a mo del with just the c 'th onstrain t remo v ed. F or c = 1 , 2 , ..., C let Γ C − c = C \ c ′ =0 c ′ 6 = c Γ ( c ′ ) . denote the enlarged tree spae with the c 'th onstrain t remo v ed, and let Γ C − c K b e the enlarged tree spae, extended to inlude atastrophes (as in Setion ( 2.2 )). F or ea h onstrain t c = 1 , 2 , ..., C w e mak e a mo del omparison b et w een a ommon n ull mo del with all the onstrain ts, H 0 : g ∈ Γ C K , and an alternativ e mo del H 1 : g ∈ Γ C − c K with the onstrain t remo v ed. The Ba y es fator B C,C − c for the mo del omparison is the ratio of the p osterior probabilities for these mo dels with mo del prior P ( H 0 ) = P ( H 1 ) , B − c = P ( D | g ∈ Γ C K ) P ( D | g ∈ Γ C − c K ) = P ( D | g ∈ Γ C K ∩ Γ C − c K ) P ( D | g ∈ Γ C − c K ) = P ( g ∈ Γ C K | D , g ∈ Γ C − c K ) P ( g ∈ Γ C K | g ∈ Γ C − c K ) . where the seond line follo ws sine Γ C K ⊂ Γ C − c K and the third from the denition of the onditional probabilities. The n umerator P ( g ∈ Γ C K | D , g ∈ Γ C − c K ) is the p osterior probabilit y for the c 'th onstrain t to b e satised giv en the data and the other onstrain ts. The denominator, P ( g ∈ Γ C K | g ∈ Γ C K ) , is the prior probabilit y for the c 'th onstrain t to b e satised giv en the other onstrain ts. W e estimate these probabilities using sim ulation of the p osterior and prior distributions with onstrain t c remo v ed. The Ba y es fators are estimated with negligible unertain t y and 2 log ( B C,C − c ) is plotted for c = 1 , 2 , ..., C in the top half of Figure 3. Strong evidene against a alibration is failure at predition. T aking a Ba y es fator exeeding 12 (that is, 2 log ( B C,C − c ) & 5 in Figure 3) as strong evidene against the onstrain t, follo wing Raftery (1996 ), w e ha v e onit for three of the thirt y onstrain ts: the ages of Old Irish and A v estan, and for the age of the Balto-Sla v lade. As our analysis in Setion 7 sho ws, there is a high p osterior probabilit y that a atastrophe ev en t o urred on the bran h b et w een Old Irish and W elsh, and another b et w een Old P ersian and A v estan. The evidene for rate heterogeneit y in rest of the tree is so sligh t, that when w e try to predit these alibrations w e are prediting at ypial ev en ts. Our missing-data analysis has help ed here. The alibration in terv al for the Hittite v o abulary in these data is 32003700BP . A reonstrution of the age of Hittite whi h ignores missing data predits 602010BP , w ell outside of the onstrain ts. The 95% HPD in terv al for the age of Hittite in our mo del is 4303250BP , whi h just o v erlaps the onstrain t. The Ba y es fator giv es o dds less than 3:1 against, so the evidene against the onstrain t is `hardly w orth men tioning'. 14 8000 6000 4000 2000 0 −100 −10 −5 −2 0 2 5 10 100 HI TA TB LU LY OI UM OS LA GK AR GO ON OE OG OS PR AV PE VE CE IT GE WG NW BS BA IR II TG Figure 3: R e onstrution of known no de ages: top, lo garithm of Bayes fators log( B C,C − c ) for c = 1 , 2 , ..., C ; b ottom, thin lines show age onstr aints for dier ent no des, thik lines show 95% p osterior HPD interval for the r e onstrute d dates. The no des ar e displaye d in the same or der as the onstr aint in Fig. 5 15 armenian albanian oldirish welsh luvian oldnorse oldenglish oldhighgerman gothic lycian oldcslavonic latvian lithuanian oldprussian tocharian_a tocharian_b hittite greek vedic avestan oldpersian latin umbrian oscan 62 78 66 85 58 0 1000 2000 3000 4000 5000 6000 7000 8000 Figure 4: Consensus tr e e for the R inge et al. (2002 ) dataset. R e d dots show atastr ophes supp orte d with pr ob ability ab ove one half. 7 Results In this setion w e presen t results for our MCMC sim ulation of the full p osterior, Equation ( 8). An upp er limit T = 16000 w as used in the tree prior of Setion 2.1. An y v alue for T exeeding around T = 100 00 w ould lead to the same results. W e sho w a onsensus tree in gure 4. In a tree, an edge orresp onds to a split partitioning the lea v es in to t w o sets. A onsensus tree displa ys just those splits presen t in at least 50% of the p osterior sample. Splits whi h reeiv e less than 95% supp ort are lab elled. Where no split is presen t in 50% of the p osterior sample, the onsensus tree is m ultifurating. The date sho wn for a no de is the a v erage p osterior date giv en the existene of the split; similarly , the n um b er of atastrophes sho wn on an edge is the a v erage p osterior n um b er of atastrophes on that edge giv en the existene of the split, rounded to the nearest in teger. Our estimates for the parameters are as follo ws: µ = 1 . 8 6 · 10 − 4 ± 1 . 47 · 10 − 5 deaths/y ear; κ = 0 . 361 ± 0 . 0 5 5 ; ρ = 6 . 4 · 10 − 5 ± 3 . 3 · 10 − 5 atastrophes/y ear (orresp onding to large but rare atastrophes: ab out 1 atastrophe ev ery 15,000 y ears, or an a v erage of 3.4 on the tree, with ea h atastrophe orresp onding to 2400 y ears of hange). W e displa y in Figure 5 the alibration onstrain ts on a tree sampled from the p osterior. The onstrain ts annot b e sho wn on a onsensus tree, as slies aross a onsensus tree are not iso hronous. The analysis reonstruts some w ell-kno wn features of the Indo-Europ ean tree. The Germani, Celti 16 oldcslavonic lithuanian latvian oldprussian vedic oldpersian avestan latin oscan umbrian welsh oldirish gothic oldhighgerman oldenglish oldnorse greek armenian lycian luvian hittite tocharian_b tocharian_a albanian 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Figure 5: A typi al sample fr om the p osterior. A l l the onstr aints on the no de ages ar e shown, ex ept those for the Itali, Indo-Ir anian and Ir anian gr oups, for whih we do not have an upp er b ound, and the r o ot, whih has an upp er b ound at T = 16000 ye ars BP. 17 and Itali families are group ed together, but no partiular onguration of their relativ e p ositions is fa v ored. The Indo-P ersian group an fall outside the Balto-Sla v group but the relativ e p osition of these t w o is unertain. The deep top ology of the tree is left quite unertain b y these data, esp eially the p osition of Albanian. W e nd evidene for atastrophi rate heterogeneit y in three p ositions: on the edges leading to Old Irish, Old P ersian, and in the Um brian-Osan lade. Our estimate for the ro ot age of the Indo-Europ ean family is 8430 ± 1320 y ears BP . The distribution of this k ey statisti is lose to normal. 8 Conlusions Our results giv e a ro ot age for the most reen t ommon anestor of the Indo-Europ ean family of language v o abularies in agreemen t with earlier ph ylogeneti studies. Our results are in agreemen t with mo dels whi h put this date around 8500 BP , and in onit with mo dels whi h require it to b e less than 6500 y ears BP . Our studies of syn theti out-of-mo del data, and reonstrution tests for kno wn historial data supp ort our view that this main result is robust to mo del error. It w ould not b e robust to a step hange in the rate of lexial div ersiation ating in a o ordinated fashion aross the Indo-Europ ean languages extan t some 3000 to 5000 y ears ago. The metho ds outlined here for handling missing data and rate heterogeneit y in the div ersiation of languages, as seen through lexial data, will nd appliations to generi trait data. A Reursions for other registration pro esses This setion omplemen ts Setion 3 : w e giv e iterations for u (0) i , u (1) i , u ( s i − 1) i , u ( s i ) i , v ( s i − 1) i and v ( s i ) i . These are the quan tities needed (as in Equation 3 ) to ev aluate the sum in Equation 2, for registration rules whi h use Condition (1) in om bination with other onditions from Setion 2.3 . Consider a pair of edges h c 1 , i i , h c 2 , i i in E . In the notation of the text, u (0) i = (1 − δ i,c 1 ) + δ i,c 1 u (0) c 1 (1 − δ i,c 2 ) + δ i,c 2 u (0) c 2 u (1) i = δ i,c 1 (1 − δ i,c 2 ) u (1) c 1 + δ i,c 2 (1 − δ i,c 1 ) u (1) c 2 + δ i,c 1 δ i,c 2 ( u (1) c 1 u (0) c 2 + u (0) c 1 u (1) c 2 ) u ( s i ) i = δ i,c 1 u ( s c 1 ) c 1 δ i,c 2 u ( s c 2 ) c 2 u ( s i − 1) i = δ i,c 1 u ( s c 1 − 1) c 1 + I { s c 1 =1 } (1 − δ i,c 1 ) δ i,c 2 u ( s c 2 ) c 2 + δ i,c 1 u ( s c 1 ) c 1 δ i,c 2 u s c 2 − 1 c 2 + I { s c 2 =1 } (1 − δ i,c 2 ) v (0) i = δ i,c 1 v (0) c 1 + (1 − δ i,c 1 ) Y j ∈ V c 1 L ξ j δ i,c 2 v (0) c 2 + (1 − δ i,c 2 ) Y j ∈ V c 2 L ξ j v ( s i ) i = δ i,c 1 v ( s c 1 ) c 1 + (1 − δ i,c 1 ) Y j ∈ V c 1 L (1 − ξ j ) δ i,c 2 v ( s c 2 ) c 2 + (1 − δ i,c 2 ) Y j ∈ V c 2 L (1 − ξ j ) v ( s i − 1) i = δ i,c 1 v ( s c 1 − 1) c 1 + (1 − δ i,c 1 ) s c 1 (1 − ξ ) s c 1 − 1 ξ δ i,c 2 v ( s c 2 ) c 2 + (1 − δ i,c 2 )(1 − ξ ) s c 2 + δ i,c 1 v ( s c 1 ) c 1 + (1 − δ i,c 1 )(1 − ξ ) s c 1 δ i,c 2 v ( s c 2 − 1) c 2 + (1 − δ i,c 2 ) s c 2 (1 − ξ ) s c 2 − 1 ξ 18 The reursion is ev aluated from the lea v es i ∈ V L , at whi h u (0) i = u ( s i − 1) i = 1 − ξ i u (1) i = u ( s i ) i = ξ i v (0) i = v ( s i − 1) i = 0 v ( s i ) i = 1 Referenes Aleksey enk o, A., C. Lee, and M. Su hard (2008). W agner and Dollo: A Sto hasti Duet b y Comp osing T w o P arsimonious Solos. Systemati Biolo gy 57 (5), 772784. Dy en, I., J. Krusk al, and B. Bla k (1997). FILE IE-D A T A1. Ra w data a v ailable from h ttp://www.n tu.edu.au/eduation/langs/ielex/IE-D A T A1. Binary data a v ailable from h ttp://www.psy h.au kland.a.nz/psy h/resear h/ RusselsData .h tm. F elsenstein, J. (1981). Inferring phylo genies . Sinauer Asso iates Sunderland, Mass., USA. Garrett, A. (2006). Con v ergene in the formation of Indo-Europ ean subgroups: Ph ylogen y and hronology. Phylo geneti metho ds and the pr ehistory of languages , 139. Gey er, C. (1992). Pratial Mark o v Chain Mon te Carlo. Statisti al Sien e 7 (4), 473483. Gra y , R. and Q. A tkinson (2003). Language-tree div ergene times supp ort the Anatolian theory of Indo- Europ ean origin. Natur e 426 (6965), 435439. Lewis, P . (2001). A Lik eliho o d Approa h to Estimating Ph ylogen y from Disrete Morphologial Charater Data. Systemati Biolo gy 50 (6), 913925. Mallory , J. (1989). In se ar h of the Indo-Eur op e ans: language, ar hae olo gy and myth . Thames and Hudson. MMahon, A. and R. MMahon (2005). L anguage Classi ation by Numb ers . Oxford Univ ersit y Press. Ni holls, G. K. and R. D. Gra y (2008). Dated anestral trees from binary trait data and its appliation to the div ersiation of languages. Journal of the R oyal Statisti al So iety, series B 70 (3), 545566. Raftery , A. (1996). Hyp othesis testing and mo del seletion. In W. Gilks, S. Ri hardson, and D. Spiegel- halter (Eds.), Markov Chain Monte Carlo in Pr ati e . Chapman & Hall / CR C. Renfrew, C. (1987). Ar haeology and Language. The Puzzle of Indo-Europ ean Origins. Curr ent A nthr o- p olo gy 29 , 437441. Ringe, D., T. W arno w, and A. T a ylor (2002). Indo-Europ ean and Computational Cladistis. T r ansations of the Philolo gi al So iety 100 (1), 59129. Ronquist, F., J. Huelsen b e k, and P . v an der Mark (2005). MrBa y es 3.1 Man ual. Sho ol of Computational Sien e, Florida State University . W arno w, T., S. Ev ans, D. Ringe, and L. Nakhleh (2004). A Sto hasti mo del of language ev olution that inorp orates homoplasy and b orro wing. Phylo geneti Metho ds and the Pr ehistory of L anguages . 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment