Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods
With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, “gene tree heterogeneity”, which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus datasets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be “statistically consistent”). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.
💡 Research Summary
The paper investigates a fundamental limitation of widely used species‑tree inference methods when the number of sites per locus is bounded—a realistic scenario in genomic studies where recombination‑free loci are short. The authors work under the MSC+CFN model, which couples the multispecies coalescent (MSC) governing gene‑tree variation with the symmetric two‑state Cavender‑Farris‑Neyman (CFN) substitution model (a binary analogue of Jukes‑Cantor). This framework captures the essential sources of gene‑tree heterogeneity (incomplete lineage sorting) while remaining mathematically tractable.
Three families of methods are examined:
- Fully partitioned maximum likelihood (ML) – each locus receives its own set of branch‑length parameters, but a single tree topology is estimated across all loci.
- Topology‑based summary methods – gene trees are first estimated (typically by ML) and then combined using a coalescent‑aware summary algorithm (e.g., ASTRAL, MP‑EST, NJst). The authors define “reasonable” summary methods as those that, for four taxa, return the most frequent quartet topology among the input gene trees.
- Weighted statistical binning (WSB) pipelines – gene trees are grouped into bins based on bootstrap support, supergene trees are inferred on each bin (using fully partitioned ML), and the resulting trees are fed to a summary method. A variant, WSB*, discards genes lacking any high‑support edge before binning.
The central phenomenon driving the negative results is long‑branch attraction (LBA). By constructing a four‑taxon “Felsenstein zone” tree (ab|cd) with one long internal edge and two short external edges, the authors show that, even when each locus evolves under the same underlying species tree, the limited sequence length per locus makes the ML estimate of each gene tree biased toward the incorrect topology (ac|bd). This bias persists regardless of how many loci are sampled.
The main theorems are:
-
Theorem 1 (Partitioned ML inconsistency) – For any fixed site count L, there exists a species tree such that, as the number of loci m → ∞, the fully partitioned ML estimator converges with probability 1 to a topology different from the true one. Thus, partitioned ML is not statistically consistent under bounded L and can be positively misleading.
-
Theorem 2 (Summary‑method inconsistency) – Any reasonable summary method that takes ML gene trees as input also fails to be consistent under the same bounded‑L regime. The dominant (most frequent) quartet among the biased gene trees is the wrong one, leading the summary method to output an incorrect species tree with probability approaching 1.
-
Theorem 3 (WSB inconsistency) – When each locus contains a single site (L = 1), the WSB pipeline produces a flat distribution of supergene trees for a suitable bootstrap threshold B < 1. Consequently, any reasonable summary method applied to this distribution does not converge to the true tree.
-
Theorem 4 (WSB positive mis‑lead)* – Even after discarding genes lacking high‑support edges (WSB*), there exist choices of B and species‑tree parameters such that the pipeline followed by a reasonable summary method converges almost surely to the wrong topology.
The proofs rely on explicit calculations of site‑pattern probabilities under the CFN model, exploiting the fact that with short sequences the variance of the ML estimator is large enough that the LBA bias dominates the signal. By a continuity argument, the authors extend the results from the extreme case of a single site per locus to any finite L.
Implications are profound:
-
Statistical consistency proofs for species‑tree methods typically assume both the number of loci and the number of sites per locus tend to infinity. This paper shows that dropping the latter assumption invalidates those guarantees, even for the simplest homogeneous models.
-
In practice, genomic datasets consist of many short, recombination‑free loci. The findings suggest that current pipelines (partitioned ML, coalescent‑based summary, weighted binning) may systematically infer the wrong species tree, regardless of how many loci are included.
-
Long‑branch attraction is not merely a nuisance for gene‑tree inference; it propagates through every downstream step when data per locus are limited. Consequently, methods that rely on accurate gene‑tree estimation (including binning strategies) inherit this bias.
-
Future method development must explicitly account for bounded sequence length. Potential directions include models that incorporate branch‑length uncertainty, Bayesian integration over gene‑tree posteriors, or algorithms that directly infer species trees from site patterns without intermediate gene‑tree reconstruction.
In summary, the paper delivers a rigorous, model‑based demonstration that the most common species‑tree inference strategies become statistically inconsistent—and even positively misleading—when each locus provides only a finite amount of sequence data. This challenges the prevailing confidence in large‑scale phylogenomic analyses and calls for new approaches that remain reliable under realistic data constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment