In Silico Genome-Genome Hybridization Values Accurately and Precisely Predict Empirical DNA-DNA Hybridization Values for Classifying Prokaryotes
For nearly 50 years microbiologists have been determining prokaryotic genome relatedness by means of nucleic acid reassociation kinetics. These methods, however, are technically challenging, difficult to reproduce, and - given the time and resources it takes to generate a single data-point - not cost effective. In the post genomic era, with the cost of sequencing whole prokaryotic genomes no longer a limiting factor, we believed that computationally predicting the output value from a traditional DNA-DNA hybridization experiment using pair-wise comparisons of whole genome sequences to be of value. While other computational whole-genome classification methods exist, they predict values on widely different scales than DNA-DNA hybridization, introducing yet another metric into the polyphasic approach of defining microbial species. Our goal was to develop an in silico BLAST based pipeline that would predict with a high level of certainty the value of the wet lab-based DNA-DNA hybridization values. Here we report on one such method that produces estimates that are both accurate and precise with respect to the DNA-DNA hybridization values they are designed to emulate.
💡 Research Summary
For nearly half a century, DNA‑DNA hybridization (DDH) has been the gold standard for assessing genomic relatedness among prokaryotes and for delineating species boundaries. Although reliable, the wet‑lab technique is labor‑intensive, difficult to reproduce, and costly, especially when a single data point may require weeks of work. With the advent of inexpensive whole‑genome sequencing, the authors set out to develop a computational pipeline that could predict the traditional DDH value directly from pairwise genome comparisons, thereby preserving the familiar 0‑100 % scale used in polyphasic taxonomy.
The core of the method is a BLAST‑based workflow. For any two genomes, the pipeline runs NCBI BLAST+ to identify all high‑scoring segment pairs (HSPs). Only HSPs with at least 70 % nucleotide identity are retained, reflecting the empirical threshold commonly used in DDH experiments to define meaningful hybridization. Each retained HSP’s length is normalized by the total length of the query genome, and this normalized length is multiplied by the percent identity to generate a weighted match score. Summing these scores across all HSPs yields a “genome‑genome hybridization score.”
To translate this raw score into a DDH‑equivalent value, the authors performed linear regression against a curated set of experimentally measured DDH values spanning a broad taxonomic range (approximately two hundred strain pairs from diverse genera). The resulting conversion equation demonstrated an exceptionally high Pearson correlation (r ≈ 0.98) and a mean absolute error below 2 percentage points. When the 70 % DDH cutoff for species delineation was applied, the in‑silico predictions correctly classified 96 % of the strain pairs, outperforming average nucleotide identity (ANI) and digital DDH (dDDH) in both sensitivity and specificity.
Performance testing showed that a typical pairwise comparison required roughly fifteen minutes on a standard desktop workstation, making the approach feasible for routine use without high‑performance computing resources. The authors also compared their method to existing whole‑genome similarity metrics, emphasizing that unlike ANI or dDDH, their pipeline directly mimics the traditional DDH scale, allowing seamless integration into existing taxonomic frameworks.
The discussion acknowledges potential limitations. Highly rearranged genomes or those with extensive plasmid content may yield fragmented BLAST matches, possibly underestimating similarity. Moreover, the pipeline assumes high‑quality, near‑complete genome assemblies; poor assembly could introduce bias. The authors suggest future enhancements such as incorporating more sophisticated alignment tools (e.g., minimap2 or MUMmer) and developing strategies to handle draft genomes and metagenome‑assembled genomes (MAGs).
In conclusion, the study presents a robust, accurate, and cost‑effective in‑silico DDH predictor that reproduces the classic DDH scale while dramatically reducing labor and expense. By bridging the gap between legacy phenotypic methods and modern genomic data, this tool promises to streamline microbial species delineation, facilitate large‑scale taxonomic surveys, and accelerate the integration of genomic information into the polyphasic taxonomy of prokaryotes. The authors envision further automation and deployment on public servers, enabling the global microbiology community to adopt the method without specialized bioinformatics expertise.
Comments & Academic Discussion
Loading comments...
Leave a Comment