Likelihood-free inference of phylogenetic tree posterior distributions
Phylogenetic inference, the task of reconstructing how related sequences evolved from common ancestors, is a central objective in evolutionary genomics. The current state-of-the-art methods exploit probabilistic models of sequence evolution along phylogenetic trees, by searching for the tree maximizing the likelihood of observed sequences, or by estimating the posterior of the tree given the sequences in a Bayesian framework. Both approaches typically require to compute likelihoods, which is only feasible under simplifying assumptions such as independence of the evolution at the different positions of the sequence, and even then remains a costly operation. Here we present the first likelihood-free inference method for posterior distributions over phylogenies. It exploits a novel expressive encoding for pairs of sequences, and a parameterized probability distribution factorized over a succession of subtree merges. The resulting network provides well-calibrated estimates of the posterior distribution leading to more accurate tree topologies than existing methods, even under models amenable to likelihood computation. We further show that its edge against likelihood-based methods dramatically increases under models of sequence evolution with intractable likelihoods.
💡 Research Summary
The paper introduces Phyloformer 2, a novel likelihood‑free Bayesian inference framework for phylogenetic tree posterior distributions. Traditional phylogenetic methods rely on explicit likelihood calculations under continuous‑time Markov models of sequence evolution. Computing these likelihoods via Felsenstein’s pruning algorithm is computationally intensive, especially as the number of taxa and sequence length increase, and it forces simplifying assumptions such as site independence and homogeneous substitution processes. Phyloformer 2 circumvents the need for any likelihood evaluation by employing Neural Posterior Estimation (NPE), training a neural network to directly approximate the posterior distribution p(θ | x) where θ denotes a tree (topology τ and branch lengths ℓ) and x is a multiple sequence alignment (MSA).
The architecture consists of two key modules: evoPF and BayesNJ. evoPF is a transformer‑inspired encoder derived from AlphaFold 2’s EvoFormer, but adapted for phylogenetics. It processes the MSA through two parallel stacks. The MSA stack maintains per‑position embeddings for each sequence and applies alternating column‑wise and row‑wise gated self‑attention, allowing information to flow both within a sequence (across sites) and across sequences (at the same site). Simultaneously, a pair stack holds embeddings for every unordered pair of sequences and updates them via self‑attention across all pairs. The two stacks interact: the outer‑product mean of sequence embeddings is added to pair embeddings, and a linear bias derived from pair embeddings is injected into the column‑wise attention scores. After twelve such blocks, the per‑position embeddings are averaged across sites, yielding a single vector per sequence. Collectively these vectors constitute ψ(x), a compact representation of the entire alignment.
BayesNJ takes ψ(x) as input and defines a probabilistic model over tree construction as a sequence of merges. Starting with N leaves, at each step k a candidate pair m(k) is selected from a set C(k) of “mergeable” nodes. The selection probability is parameterized by a softmax whose logits are linear functions of ψ(x). Once a pair is merged, a new internal node u(k) is created and the representation of the merged cluster is updated. Branch lengths associated with the two new edges are modeled by Beta/Gamma distributions whose parameters are also derived from ψ(x). The product of all merge‑step probabilities and branch‑length densities defines qψ(θ | x), an explicit, tractable distribution over full tree topologies and branch lengths. Because the model is fully differentiable, training proceeds by minimizing the expected KL divergence between qψ and the true posterior, which reduces to maximizing the average log‑probability of sampled (x, θ) pairs under qψ. Crucially, the training data are generated by forward simulation from a prior over trees and a chosen evolutionary model; no evaluation of p(x | θ) is required, making the approach applicable to models where the likelihood is intractable (e.g., site‑dependent rates, selection, recombination).
Empirical evaluation covers two regimes. In the first, data are simulated under standard GTR + Γ models where the likelihood is tractable. Phyloformer 2 outperforms state‑of‑the‑art likelihood‑based methods (IQ‑TREE, RAxML) and earlier likelihood‑free approaches (original Phyloformer, variational Bayes) in terms of Robinson‑Foulds distance and calibrated posterior credible intervals. In the second regime, the authors introduce complex, non‑standard evolutionary models that render the likelihood computationally prohibitive. Here, likelihood‑based baselines suffer dramatic accuracy loss, whereas Phyloformer 2 maintains low topological error—typically two to three times smaller than misspecified likelihood methods. Moreover, because inference is amortized, a trained Phyloformer 2 model can generate posterior samples for a new MSA in seconds on a single V100 GPU, representing a 10‑ to 100‑fold speedup over traditional MCMC or maximum‑likelihood pipelines.
The paper’s contributions are threefold: (1) a fully expressive, likelihood‑free parameterization of tree posteriors via a merge‑based factorization; (2) the evoPF encoder, which efficiently captures both per‑site and pairwise evolutionary signals while scaling to hundreds of sequences on modest GPU memory; and (3) demonstration that amortized inference yields both higher accuracy and orders‑of‑magnitude faster predictions, especially under complex models where likelihood calculations break down. Limitations include reliance on simulated training data, which may not capture all nuances of real biological datasets, and the current restriction to binary trees with a fixed merge order. Future work should explore domain‑adaptation techniques, extensions to multifurcating or network‑like phylogenies, and extensive benchmarking on empirical genomic data.
Comments & Academic Discussion
Loading comments...
Leave a Comment