English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. A key challenge for NLP is the extensive variation in terminology used to describe medical entities, which was especially pronounced for this newly emergent disease. Here we present an NLP toolbox comprising very large English dictionaries of synonyms for SARS-CoV-2 (including variant names) and COVID-19, which can be used with dictionary-based NLP tools. We also present a silver standard corpus generated with the dictionaries, and a gold standard corpus, consisting of PubMed abstracts manually annotated for disease, virus, symptom, protein/gene, cell type, chemical and species terms, which can be used to train and evaluate COVID-19-related NLP tools. Code for annotation, which can be used to expand the silver standard corpus or for text mining is also included. This toolbox is freely available on GitHub (on https://github.com/Aitslab/corona) and zenodo (https://doi.org/10.5281/zenodo.6642275). The toolbox can be used for a variety of text analytics tasks related to the COVID-19 crisis and has already been used to create a COVID-19 knowledge graph, study the variability and evolution of COVID-19-related terminology and develop and benchmark text mining tools.


💡 Research Summary

The paper presents a comprehensive NLP toolbox designed to support biomedical text mining related to SARS‑CoV‑2 and COVID‑19. The core components are four large English synonym dictionaries (virus, disease, variant, and mutation) and two annotated corpora—a silver‑standard corpus automatically generated from the CORD‑19 dataset and a manually curated gold‑standard corpus of PubMed abstracts.

Dictionary construction involved exhaustive harvesting of terms from public resources such as NCBI Taxonomy, Wikidata, MeSH, Disease Ontology, WHO, GISAID, Nextstrain, and Pango. Manual curation was combined with computational augmentation to generate variations (e.g., adding prefixes like “2019”, swapping “corona virus”/“coronavirus”, handling hyphenation, plural‑to‑singular conversion). Ambiguous abbreviations and nonsensical entries were filtered out. The final dictionaries contain 867 virus terms, 89 938 disease terms, 1 133 377 variant terms (after removing entries shorter than three characters), and 113 mutation terms. Scripts for reproducing and updating the dictionaries are provided.

For the silver corpus, the authors downloaded the final CORD‑19 release (June 2 2022) and extracted 764 398 abstracts. Using spaCy for sentence splitting and a customized EasyNER pipeline (which removes hyphens to improve matching), each abstract was annotated with the four dictionaries. The resulting Lund‑Annotated‑CORD‑19 corpus, stored in JSON, contains entity annotations for virus, disease, variants, and mutations. Due to CORD‑19 licensing, only 197 905 abstracts are publicly released, but full reconstruction instructions and code are supplied.

The gold corpus was built by selecting ten PubMed abstracts (published Dec 2019–Mar 2020) and manually annotating them with BioQRator for ten fine‑grained concepts: Virus_SARS‑CoV‑2, Virus_other, Virus_family, Cell, Protein/Gene, Disease_COVID‑19, Disease_other, Symptom, Species_human, Species_other. Abbreviations were annotated separately unless nested within a longer expression. The gold set is distributed in CSV, BioC XML, and BioC JSON formats.

The authors demonstrate the toolbox’s utility through several applications: (1) generating a COVID‑19 knowledge graph, (2) analyzing the evolution and variability of terminology across the pandemic literature, (3) providing training data for large language models and NER systems, and (4) enabling co‑mention and epidemiological analyses. Frequency analysis of the dictionaries on the CORD‑19 abstracts revealed over 180 distinct virus terms and more than 1 168 839 mentions of “COVID‑19” variants, underscoring the high lexical variability in pandemic texts.

In conclusion, this open‑source toolbox offers a scalable solution for handling the terminological heterogeneity inherent in COVID‑19 literature. By supplying both extensive synonym dictionaries and annotated corpora, along with reproducible scripts, the work facilitates rapid development, benchmarking, and adaptation of NLP tools for current and future biomedical crises. The modular design allows researchers to update variant and mutation lists as new strains emerge, ensuring the resource remains relevant as the pandemic evolves.


Comments & Academic Discussion

Loading comments...

Leave a Comment