Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots
Although taxonomy is often used informally to evaluate the results of phylogenetic inference and find the root of phylogenetic trees, algorithmic methods to do so are lacking. In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a “subcoloring” problem for expressing the difference between the taxonomy and phylogeny at a given rank. This algorithm improves upon the current best algorithm in terms of asymptotic complexity for the parameter regime of interest; we also describe a branch-and-bound algorithm that saves orders of magnitude in computation on real data sets. We also develop a formalism and an algorithm for rooting phylogenetic trees according to a taxonomy. All of these algorithms are implemented in freely-available software.
💡 Research Summary
This paper addresses two practical problems that biologists routinely face when comparing taxonomic classifications with phylogenetic trees: (1) quantifying the discordance at a given taxonomic rank, and (2) rooting a phylogenetic tree in a way that respects the taxonomy. The authors formalize these tasks using the notion of convex colorings. A leaf‑coloring assigns each leaf a “color” corresponding to its taxonomic label at a chosen rank; the coloring is convex if the induced subtrees for each color are pairwise disjoint. Non‑convexity indicates discordance.
Previous work (Moran & Snir, 2008‑2011) showed that finding a minimal recoloring (or equivalently a maximal convex sub‑coloring) is NP‑hard but fixed‑parameter tractable (FPT) when parameterized by τ, the total number of “bad” colors. Their algorithms have a runtime exponential in τ, which can be prohibitive because τ may be large even when the actual conflict is localized.
The key contribution of this paper is to replace τ with a more realistic local parameter β, defined as the maximum number of colors cut by any single edge. Empirical data on 16S rRNA and functional‑gene trees show that β is typically far smaller than τ, reflecting the fact that most taxonomic labels form large, mostly coherent clades with only a few outliers (e.g., mis‑labelled species or horizontally transferred genes).
The authors develop a recursive dynamic‑programming algorithm that, at each internal node, decides how to allocate the colors that are cut by the incident edge. By restricting attention to the β cut colors, the state space shrinks to O(2^β) per node, yielding an overall runtime of O(n·β·2^β) for a tree with n leaves. To further accelerate the search, they embed a branch‑and‑bound scheme: upper bounds on the size of a sub‑coloring are computed from leaf‑color counts in each subtree, allowing early pruning of branches that cannot improve the current best solution. Benchmarks on real datasets demonstrate orders‑of‑magnitude speed‑ups (often >10×) compared with the earlier FPT algorithms, while always returning the optimal maximal convex sub‑coloring.
The second part of the work critiques the “obvious” taxonomic rooting definition, which simply selects the edge that minimizes the number of discordant colors. This definition is shown to be unstable because it depends heavily on the placement of the root and can be misled by a single highly discordant clade. The authors therefore propose a stronger notion of convexity: a coloring is strongly convex if each color occupies a single rooted subtree, not merely a connected induced subtree. Under this definition, the rooting problem reduces to finding the minimal set of colors whose removal makes the tree strongly convex. The same sub‑coloring algorithm can be applied, producing either a unique taxonomic root or a small set of candidate roots when multiple optimal solutions exist.
All algorithms are implemented in an open‑source software package (available on GitHub) that accepts standard phylogenetic tree formats (Newick) and taxonomic annotation files (e.g., NCBI taxonomy dumps). Users can specify the taxonomic rank of interest, choose between the weak and strong convexity criteria, and enable the branch‑and‑bound optimizer. The software outputs the size of the maximal convex sub‑coloring, the list of leaves that must be excluded to achieve convexity, and the inferred taxonomic root(s) together with visualizations.
In discussion, the authors note that while the paper focuses on leaf‑only colorings, the framework naturally extends to internal node colorings and to weighted recoloring where different taxa have different mis‑labeling costs. They also suggest future work on handling non‑binary trees, incorporating uncertainty in taxonomic assignments, and integrating the method into larger phylogenomic pipelines. Overall, the paper provides a rigorous, computationally efficient solution to two longstanding, informal practices in phylogenetics, bridging the gap between taxonomy and tree inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment