Measuring Fit of Sequence Data to Phylogenetic Model: Gain of Power using Marginal Tests
Testing fit of data to model is fundamentally important to any science, but publications in the field of phylogenetics rarely do this. Such analyses discard fundamental aspects of science as prescribed by Karl Popper. Indeed, not without cause, Popper (1978) once argued that evolutionary biology was unscientific as its hypotheses were untestable. Here we trace developments in assessing fit from Penny et al. (1982) to the present. We compare the general log-likelihood ratio (the G or G2 statistic) statistic between the evolutionary tree model and the multinomial model with that of marginalized tests applied to an alignment (using placental mammal coding sequence data). It is seen that the most general test does not reject the fit of data to model (p~0.5), but the marginalized tests do. Tests on pair-wise frequency (F) matrices, strongly (p < 0.001) reject the most general phylogenetic (GTR) models commonly in use. It is also clear (p < 0.01) that the sequences are not stationary in their nucleotide composition. Deviations from stationarity and homogeneity seem to be unevenly distributed amongst taxa; not necessarily those expected from examining other regions of the genome. By marginalizing the 4t patterns of the i.i.d. model to observed and expected parsimony counts, that is, from constant sites, to singletons, to parsimony informative characters of a minimum possible length, then the likelihood ratio test regains power, and it too rejects the evolutionary model with p « 0.001. Given such behavior over relatively recent evolutionary time, readers in general should maintain a healthy skepticism of results, as the scale of the systematic errors in published analyses may really be far larger than the analytical methods (e.g., bootstrap) report.
💡 Research Summary
The paper addresses a fundamental yet neglected aspect of phylogenetic inference: testing whether sequence data fit the evolutionary model employed. While Popper famously criticized evolutionary biology for lacking falsifiable hypotheses, modern phylogenetic studies rarely perform explicit model‑fit assessments. The authors trace methodological developments from the early work of Penny et al. (1982) to contemporary practices, focusing on a comparative analysis of the general log‑likelihood ratio (G or G²) test against a suite of marginalized tests applied to a coding‑sequence alignment of placental mammals.
Using the conventional G‑test, which contrasts the full 4‑state (4t) pattern distribution under an i.i.d. multinomial model with that expected under a phylogenetic tree, the authors obtain a non‑significant result (p ≈ 0.5). This suggests, misleadingly, that the standard General Time Reversible (GTR) model adequately captures the data. To uncover hidden model violations, the authors “marginalize” the data in three ways. First, they construct pairwise frequency (F) matrices, comparing observed and expected counts of each nucleotide pair across taxa. The resulting G‑statistic is highly significant (p < 0.001), indicating that even the most general reversible model fails to reproduce the observed pairwise dependencies.
Second, they partition the alignment into constant sites, singletons, and parsimony‑informative characters of minimal length, then compute expected parsimony counts under the i.i.d. model. A likelihood‑ratio test on these marginalized categories regains statistical power and again rejects the phylogenetic model with an extreme p‑value (p ≪ 0.001). Third, they test for stationarity and homogeneity by examining nucleotide composition across lineages; both assumptions are violated (p < 0.01), and the deviations are unevenly distributed among taxa, contradicting expectations based on other genomic regions.
These findings demonstrate that the standard G‑test can be blind to substantial model misspecification, whereas marginalization restores sensitivity. The authors argue that such misspecifications, even over relatively short evolutionary timescales, can introduce systematic errors far larger than those reported by common resampling techniques such as bootstrap or Bayesian posterior probabilities. Consequently, phylogenetic results should be interpreted with caution, and routine model‑fit testing—especially using marginalized approaches—should become a standard component of phylogenetic pipelines. The paper thus calls for a cultural shift toward more rigorous falsification in evolutionary biology, aligning the field more closely with Popperian scientific standards.
Comments & Academic Discussion
Loading comments...
Leave a Comment