An analytical comparison of coalescent-based multilocus methods: The three-taxon case
Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. A large number of methods have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. Our analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths.
💡 Research Summary
The paper tackles a fundamental problem in phylogenomics: how to infer a species tree when gene trees are discordant because of incomplete lineage sorting (ILS). ILS occurs when ancestral polymorphism persists across successive speciation events, causing different loci to trace back to different coalescent histories. Over the past decade, a plethora of coalescent‑based methods have been proposed to address this issue, yet their theoretical properties remain incompletely understood, especially under realistic assumptions about gene‑tree reconstruction.
To obtain tractable analytical results, the authors focus on the simplest non‑trivial case—a three‑taxon species tree of topology ((A,B),C). They assume that each gene tree, together with its branch lengths, is reconstructed without error, thereby isolating the effect of the coalescent process itself. Under the standard neutral coalescent, the probability that the two lineages from taxa A and B fail to coalesce in the ancestral population of length τ (measured in coalescent units, τ = t/2Nₑ) is e^(−τ). Consequently, the probability that the gene tree matches the species tree is 1 − e^(−τ). This simple expression becomes the cornerstone for deriving the performance of each method.
The authors categorize the methods into four families:
-
Distance‑based summary methods (STAR, STEAC, NJst). These compute an average pairwise distance matrix across loci and then apply a distance‑based tree algorithm. The analysis shows that the expected distance matrix is a linear combination of the true species distances weighted by the coalescent probabilities. When τ is large (≫ 1), the weight on the correct topology dominates, and the methods converge rapidly. However, for small τ the weight on discordant histories is substantial, inflating variance and leading to a high probability of selecting an incorrect tree. The authors derive an explicit error bound that decays roughly as exp(−L·τ²), where L is the number of loci.
-
Maximum‑likelihood / pseudo‑likelihood methods (MP‑EST, ASTRAL). These treat each gene tree as an independent observation from the multispecies coalescent and maximize a likelihood (or pseudo‑likelihood) function. By expanding the log‑likelihood around the true species tree, the authors obtain a quadratic form whose curvature depends on τ. The curvature is maximal when τ≈ln 2 (≈0.69 coalescent units), implying that the methods are most informative at intermediate branch lengths. For τ → 0 or τ → ∞ the curvature vanishes, and the likelihood surface becomes flat, causing slower convergence. Closed‑form expressions for the probability of correct inference, P_correct ≈ 1 − exp(−L·f(τ)), are provided, where f(τ) is a τ‑dependent information function.
-
Time‑based methods (GLASS). GLASS uses estimated coalescence times directly, constructing a species tree by clustering the earliest coalescence events across loci. The analysis demonstrates that GLASS is robust to very short τ because the timing information can still discriminate between the two possible discordant histories, provided that the variance of time estimates is low. However, the method’s accuracy deteriorates sharply when the variance of branch‑length estimates grows, which is typical for short loci or low sequencing depth.
-
Quartet‑based summary methods (ASTRAL‑III, SVDquartets). These extract all possible three‑taxon sub‑trees (quartets) from each gene tree and solve a combinatorial optimization problem to find the species tree that maximizes quartet support. The authors prove that, for the three‑taxon case, the quartet count reduces to a simple majority vote whose error probability is given by a binomial tail: P_error = ∑_{k=⌈L/2⌉}^{L} (L choose k) (e^(−τ))^k (1 − e^(−τ))^{L−k}. This expression shows that even for modest τ, the error declines exponentially with L, and the method remains consistent for any τ > 0.
Across all families, the paper derives a critical branch length τ* that separates regimes where a method is guaranteed to be statistically consistent from regimes where it may be inconsistent for finite L. For distance‑based methods τ*≈1.5 coalescent units, for pseudo‑likelihood methods τ*≈0.8, and for quartet methods τ*≈0 (i.e., they are consistent for any positive τ).
The authors complement the analytical work with extensive Monte‑Carlo simulations that vary τ, L, and the amount of branch‑length noise. The simulated error rates match the theoretical predictions closely, confirming the validity of the approximations. Notably, when τ is small (≤0.5) and L is modest (≤50), all methods exhibit substantial error, but quartet‑based approaches retain the lowest error, while distance‑based methods perform worst. As L increases to 200–500 loci, quartet and pseudo‑likelihood methods converge to near‑perfect accuracy, whereas distance‑based methods still lag behind unless τ is large.
The discussion emphasizes practical implications. First, researchers should estimate the expected internal branch length (e.g., via fossil calibrations or preliminary analyses) before selecting a method. If τ is suspected to be short, quartet‑based or pseudo‑likelihood methods are advisable. Second, the assumption of perfectly reconstructed gene trees is unrealistic; the authors outline how gene‑tree estimation error can be incorporated into the analytical framework, predicting a shift of τ* toward larger values. Third, while the three‑taxon case is analytically tractable, extending the results to larger trees will require approximations or simulation‑based calibration, but the qualitative insights (e.g., the advantage of quartet support) are expected to hold.
In summary, the paper provides the first rigorous, closed‑form comparison of several leading coalescent‑based species‑tree methods under a common, analytically solvable scenario. By linking method performance directly to the coalescent branch length τ and the number of loci L, it offers clear guidance for method choice and highlights the inherent trade‑offs between using distance summaries, likelihood frameworks, timing information, or quartet aggregation. The work lays a solid theoretical foundation for future extensions to more complex species trees and for incorporating realistic sources of error such as gene‑tree uncertainty and model misspecification.
Comments & Academic Discussion
Loading comments...
Leave a Comment