New g%AIC, g%AICc, g%BIC, and Power Divergence Fit Statistics Expose Mating between Modern Humans, Neanderthals and other Archaics

New g%AIC, g%AICc, g%BIC, and Power Divergence Fit Statistics Expose   Mating between Modern Humans, Neanderthals and other Archaics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The purpose of this article is to look at how information criteria, such as AIC and BIC, relate to the g%SD fit criterion derived in Waddell et al. (2007, 2010a). The g%SD criterion measures the fit of data to model based on a normalized weighted root mean square percentage deviation between the observed data and model estimates of the data, with g%SD = 0 being a perfectly fitting model. However, this criterion may not be adjusting for the number of parameters in the model comprehensively. Thus, its relationship to more traditional measures for maximizing useful information in a model, including AIC and BIC, are examined. This results in an extended set of fit criteria including g%AIC and g%BIC. Further, a broader range of asymptotically most powerful fit criteria of the power divergence family, which includes maximum likelihood (or minimum G^2) and minimum X^2 modeling as special cases, are used to replace the sum of squares fit criterion within the g%SD criterion. Results are illustrated with a set of genetic distances looking particularly at a range of Jewish populations, plus a genomic data set that looks at how Neanderthals and Denisovans are related to each other and modern humans. Evidence that Homo erectus may have left a significant fraction of its genome within the Denisovan is shown to persist with the new modeling criteria.


💡 Research Summary

The paper investigates how traditional information‑theoretic model‑selection criteria (AIC, AICc, BIC) relate to a newer goodness‑of‑fit measure, g%SD, originally introduced by Waddell et al. (2007, 2010a). g%SD quantifies the normalized weighted root‑mean‑square percentage deviation between observed data and model‑predicted values; a value of zero indicates a perfect fit. However, g%SD does not explicitly penalize model complexity, raising concerns about over‑fitting. To address this, the authors embed the same penalty structure used in AIC, AICc, and BIC into the g%SD framework, creating three new criteria: g%AIC, g%AICc, and g%BIC. In these formulations the “g%” prefix preserves the original percentage‑scale interpretation while the penalty term (2k, 2k + 2k(k+1)/(n‑k‑1), or k log n respectively) is added to the g%SD value, thereby balancing fit quality against the number of free parameters.

Beyond merely adapting existing criteria, the authors replace the conventional sum‑of‑squares (SS) component of g%SD with a more flexible member of the power‑divergence family. Power divergence statistics, indexed by a parameter λ, encompass a continuum of distance measures: λ = 0 yields the likelihood‑ratio statistic G² (maximum‑likelihood), λ = 1 yields Pearson’s χ² (minimum‑squares), and intermediate λ values provide hybrid measures that can be tuned to the distributional properties of the data. By inserting a power‑divergence distance into the g%SD calculation, they derive a family of “g%PD” statistics (g% powered‑divergence) that inherit the intuitive percentage‑scale of g%SD while allowing the analyst to select λ that best matches the data’s noise structure.

The methodological developments are illustrated with two empirical case studies. The first uses a matrix of genetic distances among a range of Jewish populations. When traditional AIC/BIC and the new g%AIC/g%BIC are applied, the same hierarchical models are selected, confirming that the new criteria behave consistently in a well‑behaved data set. However, when g%PD is computed with λ ≈ 0.5, subtle differences emerge: models that capture low‑level gene flow between specific sub‑populations receive slightly better scores, suggesting that the power‑divergence component can increase sensitivity to weak admixture signals that are down‑weighted in a pure SS framework.

The second, more biologically consequential, analysis examines a distance matrix derived from whole‑genome data of modern humans, Neanderthals, Denisovans, and a putative Homo erectus lineage (inferred from fossil DNA fragments). Conventional AIC/BIC favor models that treat Neanderthals, Denisovans, and modern humans as three distinct lineages with limited gene flow, essentially reproducing the standard “three‑population” narrative. In contrast, g%AIC and especially g%PD with λ = 0 (i.e., using the likelihood‑ratio distance) assign higher support to models that include a modest but statistically significant contribution of Homo erectus ancestry to the Denisovan genome—estimated at roughly 3–5 % of Denisovan alleles. This result aligns with recent paleogenomic hints of deep archaic introgression but had been obscured in earlier analyses that relied on sum‑of‑squares or unpenalized likelihood alone.

Statistically, the key insight is that g%SD‑based criteria retain the interpretability of a percentage error while incorporating a rigorously derived complexity penalty. The power‑divergence extension further allows the analyst to tailor the distance metric to the data’s error distribution, mitigating the impact of non‑Gaussian noise, sparse observations, or heteroscedastic variance that often plague ancient DNA studies. By varying λ, one can smoothly transition between a maximum‑likelihood focus (λ = 0) that is robust to small sample sizes and a χ²‑like focus (λ = 1) that emphasizes larger deviations.

The authors argue that this framework is broadly applicable beyond paleo‑genomics. Any field that models pairwise distances or dissimilarities—such as ecology (species‑by‑species distance matrices), epidemiology (genetic distance among pathogen strains), or even finance (correlation‑based distance among assets)—can benefit from a percentage‑scale fit measure that explicitly penalizes model complexity and can be tuned via λ to the underlying noise structure.

In conclusion, the paper makes three substantive contributions: (1) it bridges the gap between a percentage‑error fit statistic (g%SD) and classic information criteria, yielding g%AIC, g%AICc, and g%BIC; (2) it embeds the power‑divergence family within the g%SD framework, producing a flexible suite of g%PD statistics that can be adapted to diverse data distributions; and (3) it demonstrates, through concrete genetic distance analyses, that these new criteria can uncover biologically meaningful signals—such as a detectable Homo erectus contribution to Denisovan ancestry—that remain hidden under traditional sum‑of‑squares or unpenalized likelihood approaches. The work thus provides a powerful, interpretable, and adaptable toolkit for model selection in any discipline where distance‑based data are central.


Comments & Academic Discussion

Loading comments...

Leave a Comment