Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European
📝 Abstract
Nicholls and Gray (2008) describe a phylogenetic model for trait data. They use their model to estimate branching times on Indo-European language trees from lexical data. Alekseyenko et al. (2008) extended the model and give applications in genetics. In this paper we extend the inference to handle data missing at random. When trait data are gathered, traits are thinned in a way that depends on both the trait and missing-data content. Nicholls and Gray (2008) treat missing records as absent traits. Hittite has 12% missing trait records. Its age is poorly predicted in their cross-validation. Our prediction is consistent with the historical record. Nicholls and Gray (2008) dropped seven languages with too much missing data. We fit all twenty four languages in the lexical data of Ringe (2002). In order to model spatial-temporal rate heterogeneity we add a catastrophe process to the model. When a language passes through a catastrophe, many traits change at the same time. We fit the full model in a Bayesian setting, via MCMC. We validate our fit using Bayes factors to test known age constraints. We reject three of thirty historically attested constraints. Our main result is a unimodel posterior distribution for the age of Proto-Indo-European centered at 8400 years BP with 95% HPD equal 7100-9800 years BP.
💡 Analysis
Nicholls and Gray (2008) describe a phylogenetic model for trait data. They use their model to estimate branching times on Indo-European language trees from lexical data. Alekseyenko et al. (2008) extended the model and give applications in genetics. In this paper we extend the inference to handle data missing at random. When trait data are gathered, traits are thinned in a way that depends on both the trait and missing-data content. Nicholls and Gray (2008) treat missing records as absent traits. Hittite has 12% missing trait records. Its age is poorly predicted in their cross-validation. Our prediction is consistent with the historical record. Nicholls and Gray (2008) dropped seven languages with too much missing data. We fit all twenty four languages in the lexical data of Ringe (2002). In order to model spatial-temporal rate heterogeneity we add a catastrophe process to the model. When a language passes through a catastrophe, many traits change at the same time. We fit the full model in a Bayesian setting, via MCMC. We validate our fit using Bayes factors to test known age constraints. We reject three of thirty historically attested constraints. Our main result is a unimodel posterior distribution for the age of Proto-Indo-European centered at 8400 years BP with 95% HPD equal 7100-9800 years BP.
📄 Content
arXiv:0908.1735v1 [stat.AP] 12 Aug 2009 Missing data in a sto
hasti Dollo mo del for ognate data, and its appli ation to the dating of Proto-Indo-Europ ean Robin J. Ryder and Geo K. Ni holls Departmen t of Statisti s, Univ ersit y of Oxford, UK Septem b er 18, 2018 Abstra t Ni holls and Gra y (2008 ) des rib e a ph ylogeneti mo del for trait data. They use their mo del to estimate bran hing times on Indo-Europ ean language trees from lexi al data. Aleksey enk o et al. (2008 ) extended the mo del and giv e appli ations in geneti s. In this pap er w e extend the inferen e to handle data missing at random. When trait data are gathered, traits are thinned in a w a y that dep ends on b oth the trait and missing-data on ten t. Ni holls and Gra y (2008 ) treat missing re ords as absen t traits. Hittite has 12% missing trait re ords. Its age is p o orly predi ted in their ross- v alidation. Our predi tion is onsisten t with the histori al re ord. Ni holls and Gra y (2008 ) dropp ed sev en languages with to o m u h missing data. W e t all t w en t y four languages in the lexi al data of Ringe et al. (2002 ). In order to mo del spatial-temp oral rate heterogeneit y w e add a atastrophe pro ess to the mo del. When a language passes through a atastrophe, man y traits
hange at the same time. W e t the full mo del in a Ba y esian setting, via MCMC. W e v alidate our t using Ba y es fa tors to test kno wn age onstrain ts. W e reje t three of thirt y histori ally attested onstrain ts. Our main result is a unimo del p osterior distribution for the age of Proto-indo-Europ ean en tered at 8400 y ears BP with 95% HPD equal 7100-9800 y ears BP . The Indo-Europ ean languages des end from a ommon an estor alled Proto-Indo-Europ ean. Lexi al data sho w the patterns of relatedness among Indo-Europ ean languages. These data are ogna y lasses: a pair of w ords in the same lass des end, through a pro ess of sound
hange, from a ommon an estor. F or example, English se a and German Se e are ognate to one another, but not to the F ren h mer. Gra y and A tkinson (2003 ) o ded data of this kind in a matrix in whi h ro ws orresp ond to languages and olumns to distin t ogna y lasses, and en tries are zero or one as the language p ossessed or la k ed a term in the olumn lass. They analysed these data using ph ylogeneti algorithms similar to those used for geneti data. Our analysis has the same ob je tiv es, but w e t a mo del designed for lexi al trait data. W e w ork with data ompiled b y Ringe et al. (2002 ), re ording the distribution of some 872 distin t ogna y lasses in t w en t y four mo dern and an ien t Indo-Europ ean languages. In se tion 7, w e giv e estimates for the unkno wn top ology and bran hing times of the ph ylogen y of the ore v o abulary of these languages. Ni holls and Gra y (2008 ) analyse the same data using a losely related sto
hasti Dollo-mo del for binary trait ev olution. Ho w ev er, those authors w ere unable to deal with missing trait re ords. Missing data arise when w e are unable to answ er the question do es language X p ossess a ognate in ogna y lass Y?. Ni holls and Gra y (2008 ) dropp ed sev en languages whi h had man y missing en tries, and treated missing trait re ords in the remainder as absen t traits. This is unsatisfa tory . Ho w ev er, it is not straigh tforw ard to giv e a mo del-based in tegration of missing data for the trait ev olution mo del of Ni holls and Gra y (2008 ). In this pap er w e in tegrate the missing trait data, and this te hni al adv an e allo ws us to t all t w en t y 1 four of the languages in the original data. The binary trait mo del of Ni holls and Gra y (2008 ), has b een extended b y Aleksey enk o et al. (2008 ) to m utliple-lev el traits, and is nding appli ations in biology . A prop er treatmen t of missing data will b e of use in other appli ations. W e are sp e i ally in terested in ph ylogeneti dating. Be ause w e are w orking with lexi al, and not syn ta ti data, it is the age of the bran hing of the ore v o abulary of Proto-Indo-Europ ean that w e estimate. This is a on tro v ersial matter. W ork ers in histori al linguisti s ha v e eviden e from linguisti paleon tology that the most re en t ommon an estor of all kno wn Indo-Europ ean languages bran hed no earlier than ab out 60006500 y ears Before the Presen t (BP) (Mallory , 1989 ). F or a re en t review of the argumen t from linguisti paleon tology , and a riti ism of ph ylogeneti dating, see Garrett (2006 ) and M Mahon and M Mahon (2005 ). An alternativ e h yp othesis suggests that the spread b egan around 8500 BP when the Anatolians mastered farming (Renfrew , 1987 ) in the early neolithi . Re en t eorts to apply quan titativ e ph ylogeneti metho ds to dating Proto-Indo-Europ ean giv e a time depth of 8000 to 9500 y ears BP (Ni holls and Gra y , 2008 ; Gra y and A tkinson , 2003 ), supp orting the link to farming. In this
This content is AI-processed based on ArXiv data.