More Mouldy Data: Another mycoplasma gene jumps the silicon barrier into the human genome
The human genome sequence database contains DNA sequences very like those of mycoplasma molds. It appears such moulds infect not only molecular Biology laboratories but were picked up by experimenters from contaminated samples and inserted into GenBank as if they were human. At least one mouldy EST (Expressed Sequence Tag) has transferred from public databases to commercial tools (Affymetrix HG-U133 plus 2.0 microarrays). We report a second example (DA466599) and suggest there is a need to clean up genomic databases but fear current tools will be inadequate to catch genes which have jumped the silicon barrier.
💡 Research Summary
The paper reports a second instance of mycoplasma-derived DNA contaminating human genomic databases, highlighting the persistent problem of “silicon‑barrier” gene jumps. The first documented case involved an expressed sequence tag (EST) that was mistakenly incorporated into the Affymetrix HG‑U133 plus 2.0 microarray platform. In the current study, the authors identify GenBank accession DA466599, originally submitted as a human cDNA sequence, as being virtually identical (≥99% similarity) to Mycoplasma fermentans. By re‑examining the original publication and performing BLAST searches, they demonstrate that the sequence likely originated from a mycoplasma‑contaminated sample rather than genuine human tissue.
The authors argue that laboratory contamination detection methods are often insufficiently sensitive or inconsistently applied, allowing mycoplasma‑laden material to pass through preprocessing steps. Moreover, existing automated annotation pipelines rely heavily on the “human” label without robust source verification, leading to systematic misannotation of non‑human sequences as human genes. The downstream impact is significant: once such a contaminant enters a public repository, it can be propagated into commercial tools, as shown by the inclusion of the first EST in an Affymetrix array, and now the DA466599 sequence in probe design pipelines. Consequently, downstream expression studies, biomarker discovery, and even clinical diagnostics may be compromised by false‑positive signals derived from bacterial DNA.
To mitigate this risk, the authors propose a multi‑tiered validation framework. First, they recommend routine mycoplasma‑specific PCR or quantitative PCR screening before any sequencing library construction. Second, they suggest integrating an automated BLAST‑based “contamination screening” module into the submission workflow of major databases such as GenBank and RefSeq. This module would flag sequences with high similarity to known bacterial genomes and alert submitters. Third, they advocate for periodic community‑driven audits of database entries, coupled with a transparent feedback mechanism that allows researchers to report suspect sequences for rapid curation.
The paper also critiques the limitations of current automated tools, noting that machine‑learning classifiers trained on existing data may fail to detect novel or low‑abundance contaminants. Therefore, continuous updating of reference bacterial genome collections and the incorporation of more sophisticated pattern‑recognition algorithms are essential. The authors emphasize that a collaborative effort among laboratory personnel, database curators, and commercial platform developers is required to maintain the integrity of genomic resources.
In conclusion, the presence of mycoplasma sequences in human genomic databases is not an isolated incident but a systemic vulnerability that threatens the reliability of large‑scale genomics research. Without improved laboratory practices, stricter submission standards, and more robust computational filters, “silicon‑barrier” jumps will continue to undermine data quality, potentially leading to erroneous biological interpretations and flawed clinical applications. The study serves as a call to action for the community to prioritize data cleanliness as a foundational aspect of modern genomics.
Comments & Academic Discussion
Loading comments...
Leave a Comment