The Complexity of Rooted Phylogeny Problems
Several computational problems in phylogenetic reconstruction can be formulated as restrictions of the following general problem: given a formula in conjunctive normal form where the literals are rooted triples, is there a rooted binary tree that satisfies the formula? If the formulas do not contain disjunctions, the problem becomes the famous rooted triple consistency problem, which can be solved in polynomial time by an algorithm of Aho, Sagiv, Szymanski, and Ullman. If the clauses in the formulas are restricted to disjunctions of negated triples, Ng, Steel, and Wormald showed that the problem remains NP-complete. We systematically study the computational complexity of the problem for all such restrictions of the clauses in the input formula. For certain restricted disjunctions of triples we present an algorithm that has sub-quadratic running time and is asymptotically as fast as the fastest known algorithm for the rooted triple consistency problem. We also show that any restriction of the general rooted phylogeny problem that does not fall into our tractable class is NP-complete, using known results about the complexity of Boolean constraint satisfaction problems. Finally, we present a pebble game argument that shows that the rooted triple consistency problem (and also all generalizations studied in this paper) cannot be solved by Datalog.
💡 Research Summary
The paper investigates a broad class of computational problems that arise in phylogenetic reconstruction, all of which can be expressed as the satisfiability of a conjunctive‑normal‑form (CNF) formula whose literals are rooted triples of the form “ab | c”. A rooted triple ab | c holds in a rooted binary tree T when the youngest common ancestor (yca) of leaves a and b lies strictly below the yca of a and c. The classic rooted‑triple consistency problem—where the input formula contains no disjunctions—is known to be solvable in polynomial time by the Aho‑Sagiv‑Szymanski‑Ullman (ASSU) algorithm. Earlier work by Ng, Steel, and Wormald showed that if each clause is a disjunction of negated triples, the problem becomes NP‑complete.
The authors introduce a unifying framework: for any finite set C of allowed triple‑clauses, the “rooted phylogeny problem for clauses from C” asks whether a given CNF formula whose clauses belong to C is satisfiable. Their goal is to classify the computational complexity of this problem for every possible C.
Key technical contributions
-
Complexity dichotomy – Using a reduction based on Lemma 2.10, the authors show that any non‑trivial clause (i.e., a clause that is not satisfied by every injective leaf‑to‑variable mapping) can be transformed into the basic triple ab | c. Consequently, if C contains at least one non‑trivial clause, the rooted phylogeny problem for C is at least as hard as the rooted‑triple consistency problem. By connecting the problem to a Boolean “split” CSP (where a solution must assign both true and false values), they invoke the known classification of split problems (Creignou‑Krokhin‑Schaefer 2001). This yields a clean dichotomy: for every C the problem is either in P or NP‑complete; there are no intermediate complexities.
-
Algorithmic tractability for a large subclass – For those C that lead to polynomial‑time solvable split problems, the authors design an explicit algorithm that runs in sub‑quadratic time. The algorithm adapts the decremental connectivity techniques of Henzinger, King, and Warnow (1996) to the structure of rooted triples. When each clause contains at most two literals and each literal involves a distinct pair of variables (e.g., xy | z ∨ yz | x), the algorithm solves the instance in O(m · √n) time, where m is the number of clauses and n the number of variables. This matches the best known bound for the pure consistency problem and improves on the naïve O(m · n) approach.
-
Datalog inexpressibility – The paper proves that neither the rooted‑triple consistency problem nor any of its non‑trivial extensions can be solved by a Datalog program. To this end, the authors encode the problem as a primitive‑positive sentence over an infinite relational structure Δ whose domain consists of all infinite binary strings. The ternary relation f g | h holds when the longest common prefix of f and g is longer than that of f and h. Using the existential pebble game (a tool from finite‑model theory), they show that Δ does not admit a Datalog definition of the required property. Hence, any Datalog program would fail to distinguish satisfiable from unsatisfiable instances, establishing a strong expressive‑power limitation.
Broader implications
- The dichotomy provides a complete map for researchers designing phylogenetic inference algorithms: by checking whether their clause set C falls into the tractable class, they can immediately know whether a polynomial‑time algorithm exists.
- The sub‑quadratic algorithm demonstrates that, for many practically relevant clause families (e.g., those arising from subtree‑avoidance or forbidden‑triple constraints), large phylogenetic datasets can be processed efficiently.
- The Datalog inexpressibility result connects phylogenetic reasoning to deep results in logic and database theory, indicating that more powerful formalisms (e.g., full first‑order logic with recursion) are necessary for exact inference in these settings.
In summary, Bodirsky and Mueller deliver a comprehensive theoretical treatment of rooted phylogeny problems: a full P/NP‑complete classification for all clause restrictions, an efficient algorithm for the tractable cases, and a proof that Datalog cannot capture even the simplest non‑trivial variants. Their work bridges phylogenetics, constraint satisfaction, and descriptive complexity, offering both practical algorithmic guidance and fundamental insights into the limits of logical inference for tree‑based data.
Comments & Academic Discussion
Loading comments...
Leave a Comment