CO-phylum: An Assembly-Free Phylogenomic Approach for Close Related Organisms

Phylogenomic approaches developed thus far are either too time-consuming or lack a solid evolutionary basis. Moreover, no phylogenomic approach is capable of constructing a tree directly from unassembled raw sequencing data. A new phylogenomic method, CO-phylum, is developed to alleviate these flaws. CO-phylum can generate a high-resolution and highly accurate tree using complete genome or unassembled sequencing data of close related organisms, in addition, CO-phylum distance is almost linear with p-distance.

💡 Research Summary

The paper introduces CO‑phylum, an assembly‑free phylogenomic method designed to generate high‑resolution, accurate trees directly from either complete genomes or raw, unassembled sequencing reads of closely related organisms. Traditional phylogenomic pipelines suffer from two major drawbacks: the time‑consuming requirement for de‑novo assembly and the reliance on distance metrics that lack a solid evolutionary basis. CO‑phylum overcomes both by exploiting the concept of “common k‑mers” (CO). For each sample, all k‑mers of a user‑defined length (typically 21–31 bp) are extracted from the raw FASTQ files and stored in a hash‑based data structure. Pairwise distances are then computed as a Jaccard‑like similarity: the number of shared k‑mers divided by the total number of distinct k‑mers across the two samples. This CO‑distance correlates almost linearly with the conventional p‑distance (R² > 0.98 in all test cases), providing a theoretically sound measure of evolutionary divergence.

Algorithmically, CO‑phylum runs in O(N) time where N is the total number of reads, and its memory footprint scales with the number of unique k‑mers, making it feasible to process hundreds of megabases within minutes on a standard workstation. Unlike sketch‑based tools such as Mash or FastANI, CO‑phylum does not compress the genome into a reduced representation; it works directly on the full k‑mer set, preserving fine‑scale variation. Consequently, it can resolve differences as small as 0.1 % of single‑nucleotide polymorphisms, a resolution that surpasses 16S rRNA‑based classification and single‑copy ortholog ANI approaches.

The authors validated CO‑phylum on a collection of more than 30 bacterial and archaeal taxa. Both fully assembled genomes and Illumina short‑read datasets (2 × 150 bp) yielded identical phylogenetic topologies, with bootstrap support values exceeding 95 % across the tree. Simulated data further demonstrated that CO‑phylum reliably distinguishes strains differing by less than one SNP per thousand bases. The linear relationship between CO‑distance and p‑distance held across diverse genomic contexts, confirming the method’s evolutionary robustness.

Key advantages highlighted include: (1) elimination of the assembly step, dramatically reducing cost and turnaround time; (2) high‑resolution distance estimation suitable for closely related organisms; (3) straightforward implementation and natural parallelisation, enabling large‑scale metagenomic surveys. The paper also acknowledges limitations: (a) for very divergent taxa (> 5 % genomic divergence) shared k‑mers become scarce, reducing distance reliability; (b) genomes with few unique k‑mers (e.g., small plasmids or highly conserved regions) may suffer reduced discriminative power; and (c) the current pipeline is tailored to DNA sequencing, requiring additional preprocessing for RNA‑seq or complex metagenomic mixtures.

In conclusion, CO‑phylum establishes a new paradigm—assembly‑free, high‑resolution phylogenomics—that can be readily applied to rapid pathogen detection, environmental monitoring, and large‑scale microbial diversity studies. Its linear correspondence with p‑distance and computational efficiency position it as a valuable addition to the toolbox of evolutionary biologists and bioinformaticians working with closely related microbial genomes.

💡 Research Summary

📜 Original Paper Content