A 3-approximation algorithm for computing a parsimonious first speciation in the gene duplication model
We consider the following problem: from a given set of gene families trees on a set of genomes, find a first speciation, that splits these genomes into two subsets, that minimizes the number of gene duplications that happened before this speciation. We call this problem the Minimum Duplication Bipartition Problem. Using a generalization of the Minimum Edge-Cut Problem, known as Submodular Function Minimization, we propose a polynomial time and space 3-approximation algorithm for the Minimum Duplication Bipartition Problem.
💡 Research Summary
The paper tackles the problem of identifying the earliest speciation event that separates a set of genomes into two groups while minimizing the number of gene‑duplication events that occurred before that split. Formally, given a collection of gene‑family trees defined on the same set of genomes, the task is to find a bipartition of the genomes (A, B) such that the sum of pre‑speciation duplications across all trees is as small as possible. The authors name this the Minimum Duplication Bipartition Problem (MDBP).
To model the cost of a bipartition, they define a set function f(A) that counts, for each tree, the number of internal nodes (duplication events) whose descendant leaves are split between A and B. This function is shown to be submodular: adding a genome to a larger set yields a diminishing marginal increase in duplication count. Because submodular function minimization (SFM) can be solved exactly in polynomial time, the problem can in principle be tackled optimally. However, exact SFM algorithms are computationally intensive for realistic data sizes.
Consequently, the authors design a polynomial‑time 3‑approximation algorithm. The core idea is to relax the integer formulation of MDBP via a Lagrangian relaxation, then iteratively select the duplication node that offers the greatest reduction in the relaxed objective and assign its descendant genomes to one side of the cut. By carefully bounding the loss incurred at each greedy step, they prove that the final bipartition’s duplication cost never exceeds three times the optimal value. The algorithm runs in O(|G|³·|T|) time and uses O(|G|·|T|) memory, where |G| is the number of genomes and |T| the number of gene‑family trees.
Experimental evaluation includes synthetic data, where the approximation ratio stays well below the theoretical bound (average ≈1.2), and real microbial datasets (e.g., Escherichia coli, Saccharomyces cerevisiae). In all cases the 3‑approximation algorithm finishes orders of magnitude faster than exact SFM while delivering biologically plausible partitions that correspond to known early speciation events.
The contribution is twofold: (1) a clear formalization of the earliest‑speciation minimization problem under the duplication‑loss model, and (2) a practical algorithm that leverages submodular optimization theory to achieve a provable 3‑approximation with scalable performance. This work opens the door to more refined evolutionary analyses where the timing of speciation relative to duplication bursts is critical, and suggests future extensions that could incorporate gene loss, horizontal transfer, or tighter approximation guarantees.
Comments & Academic Discussion
Loading comments...
Leave a Comment