Analysis of comparative data with hierarchical autocorrelation
The asymptotic behavior of estimates and information criteria in linear models are studied in the context of hierarchically correlated sampling units. The work is motivated by biological data collected on species where autocorrelation is based on the species’ genealogical tree. Hierarchical autocorrelation is also found in many other kinds of data, such as from microarray experiments or human languages. Similar correlation also arises in ANOVA models with nested effects. I show that the best linear unbiased estimators are almost surely convergent but may not be consistent for some parameters such as the intercept and lineage effects, in the context of Brownian motion evolution on the genealogical tree. For the purpose of model selection I show that the usual BIC does not provide an appropriate approximation to the posterior probability of a model. To correct for this, an effective sample size is introduced for parameters that are inconsistently estimated. For biological studies, this work implies that tree-aware sampling design is desirable; adding more sampling units may not help ancestral reconstruction and only strong lineage effects may be detected with high power.
💡 Research Summary
The paper investigates the asymptotic properties of estimators and information criteria for linear models when the sampling units are hierarchically correlated through a known phylogenetic tree. Under the Brownian motion (BM) model of trait evolution, the observed trait vector Y follows a multivariate normal distribution with mean μ (the ancestral state at the root) and covariance σ²Vtree, where Vtree encodes shared evolutionary time between each pair of tips. The linear model Y = Xβ + ε with ε ∼ N(0, σ²Vtree) is examined, assuming the design matrix X has full rank.
The first major result (Theorem 1) shows that the best linear unbiased estimator (BLUE) β̂ = (XᵀV⁻¹X)⁻¹XᵀV⁻¹Y converges almost surely and in L² as the number of tips n grows, provided the root of the tree is fixed while new tips are added. However, convergence to the true parameter value occurs only if the asymptotic variance of that component is zero. For the BM tree, the intercept (the root state) and any lineage‑specific effects have non‑vanishing asymptotic variance; consequently their estimators converge to random limits rather than the true values. In other words, these parameters are inconsistent despite the increasing sample size.
To quantify the loss of information, the author introduces an “effective sample size” ne for each inconsistent parameter. ne is a function of the tree topology: it is bounded above by k·T/t, where k is the number of branches emanating from the root, T is the average distance from the root to the tips, and t is the length of the shortest root branch. Empirical calculations on a 25‑species plant dataset give ne≈5.5, and on a 49‑species mammal dataset ne≈6.1, representing a 4‑ to 8‑fold reduction relative to the raw number of species. This reduced ne inflates confidence intervals for ancestral state estimates and dramatically lowers the power to detect lineage shifts unless the shifts are large.
The paper then critiques the conventional Bayesian Information Criterion (BIC), which penalizes each parameter by log n. Because the intercept and lineage effects are estimated with far fewer effective observations, the log n penalty is far too harsh and does not approximate the model posterior probability. The author proposes replacing log n with log(1 + ne) for each inconsistent parameter, yielding a bounded penalty that more accurately reflects the true information content. For the plant example, the intercept receives a penalty of log(1 + 5.54), and a shift on a specific lineage receives log(1 + 2.72).
A substantial portion of the work is devoted to sampling design. Since tips that are close to the root receive larger weights in the BLUE, including fossil taxa, early viral isolates, or any early‑branching lineages can substantially increase ne. Simulations demonstrate that, for both the plant and mammal trees, selecting about 15 well‑chosen tips (out of 25 or 49) yields ne values near the theoretical maximum, whereas random subsets achieve much lower ne. The optimal design therefore retains the k root branches and minimizes the lengths of those branches by favoring early‑branching lineages.
Although the analysis is carried out primarily under the BM assumption, the author notes that the results extend to Ornstein–Uhlenbeck (OU) models and other evolutionary processes that generate similar covariance structures. Moreover, the same hierarchical autocorrelation framework applies to nested ANOVA models, indicating a broad relevance beyond phylogenetics.
In conclusion, the paper demonstrates that hierarchical autocorrelation induced by a phylogenetic tree fundamentally alters the asymptotic behavior of linear model estimators. Some parameters become inconsistent, effective sample sizes are dramatically smaller than the nominal number of taxa, and standard model‑selection criteria must be adjusted accordingly. The introduction of ne and the revised BIC penalty provide practical tools for researchers, while the emphasis on tree‑aware sampling designs offers a pathway to more powerful ancestral reconstructions and detection of evolutionary shifts. This work thus bridges a gap between evolutionary biology and rigorous statistical theory, offering guidance for future comparative studies across diverse fields.
Comments & Academic Discussion
Loading comments...
Leave a Comment