Motivation: mtDNA distance matrices are standard inputs for distance-based phylogeny, but computing all pairwise alignments is costly. Missing entries can degrade inferred topology and branch lengths, and generic matrix-completion methods may disrupt tree-like (ultrametric) structure. Results: We propose Hyb-Adam-UM, which starts from an alignment-limited Needleman-Wunsch distance backbone and completes the matrix by minimizing a robust triplet ultrametric-violation functional. An Adam-style finite-difference optimizer updates only missing entries while enforcing symmetry, non-negativity, and a zero diagonal. From one complete reference matrix, we generate 20 masked instances at 30%, 50%, 65%, and 85% missingness. Hyb-Adam-UM consistently reduces ultrametric violations and achieves competitive reconstruction error, with improved topological accuracy and branch-length agreement relative to MW*/NJ* projection baselines (which exactly preserve observed distances) and Soft-Impute; gains are most pronounced at 85% missingness. Availability and implementation: https://github.com/mitichya/hyb-adam-um/; Zenodo: https://doi.org/10.5281/zenodo.18609748 Supplementary information: Supplementary data available online.
Mitochondrial DNA (mtDNA) is a central data source in evolutionary biology, population genetics, and phylogenetics due to maternal inheritance, limited recombination, and relatively high mutation rates. These properties make mtDNA particularly useful for tracing maternal lineages, estimating divergence times, and reconstructing phylogenetic relationships among closely related species [1]. In many workflows, a pairwise distance matrix derived from mtDNA sequences is not only a summary representation of sequence divergence, but also the direct input to distance-based phylogenetic tree reconstruction procedures such as Neighbor-Joining and UPGMA [2,3]. Therefore, the quality of the distance matrix has immediate consequences for inferred tree topology and branch lengths.
The precision and reliability of distance-based inference depend strongly on the completeness and global consistency of the distance matrix. At the same time, building a complete matrix is computationally demanding: for n taxa, one must compute n(n -1)/2 pairwise distances. When distances are computed via sequence alignment, each entry may require dynamic programming over sequence lengths, as in classical global alignment [4]. In our practical setting (human-length mtDNA of approximately 16,569 base pairs [5]), a Needleman-Wunsch global alignment can take on the order of minutes per pair in a straightforward implementation, which scales into hours or days for matrices of moderate size and into months for larger taxon sets. For example, a 30 × 30 matrix has 435 off-diagonal distances; if each distance required only three minutes, the full matrix would require roughly one day of wallclock time on a modern computer. If one considers hundreds of taxa, complete matrix computation becomes infeasible without extensive parallel computing. Since distance-based methods typically assume a fully specified matrix (or behave unpredictably under missing entries), incomplete computation can translate into unstable or biased reconstructed trees.
As a consequence, distance matrices used in practice are often incomplete. Incompleteness may arise because only a subset of pairwise alignments is computed (compute-budget constraints), because some sequences are missing or rejected by quality filters, or because certain alignments fail. Unfortunately, missing entries can severely affect downstream phylogenetic inference: even when tree construction remains feasible, naive handling (e.g. pairwise deletion or simplistic imputation) may introduce bias, distort branch lengths, and reduce resolution [6]. Therefore, robust restoration of incomplete distance matrices is an important enabling step for large-scale comparative genomics and reliable reconstruction of phylogenetic trees.
A further complication is that mtDNA distance matrices are not arbitrary numerical arrays. Although the entries are symmetric and nonnegative, the triangle inequality may be violated. Nevertheless, they often preserve a strong tree-like, approximately ultrametric signal: in many triplets, two distances are nearly equal and exceed the third. Generic matrix completion methods (e.g. low-rank completion) can fill missing values effectively in a least-squares sense [7, 8], yet may fail to preserve hierarchical constraints that are important for phylogenetic interpretability.
This paper addresses mtDNA-based phylogenetic tree reconstruction from mitochondrial DNA sequences in settings where only a subset of pairwise distances can be computed. We propose a distance-matrix completion procedure that promotes tree-consistent geometry rather than optimizing only entrywise reconstruction error. Concretely, we introduce a hybrid method that couples biological signals from alignment with an explicit global-consistency objective: 1.Warm-start via alignment: we use the available Needleman-Wunsch distances [4] under a limited alignment budget as an observed backbone and initialize the remaining missing entries (e.g., by the mean of observed distances). 2.Ultrametric-aware optimization: we complete the remaining missing entries by minimizing a triplet-based ultrametric-violation functional, using an Adam-style optimizer [9] with finite-difference gradients, while enforcing symmetry, non-negativity, and a zero diagonal. Beyond matrix-level accuracy, we evaluate downstream Neighbor-Joining tree agreement with the reference (normalized Robinson-Foulds distance and patristic-distance agreement), showing that ultrametric-aware completion translates into improved tree topology and branch-length consistency under high missingness. Starting from a single complete 15 × 15 mtDNA reference distance matrix, we generate 20 incomplete instances via symmetric random masking at missingness levels of 30%, 50%, 65%, and 85% (five independent masks per level), and benchmark against three baselines: MW ⋆ -proj [10], NJ ⋆ -proj (Neighbor-Joining-based completion) [2], and low-rank matrix completion via Soft-Impute [8].
Let D ∈ R
This content is AI-processed based on open access ArXiv data.