English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. Results Here we present an NLP toolbox comprising COVID-19-related dictionaries and annotated corpora in English as well as useful code and workflows for their update and use. The dictionaries contain terms referring to the COVID-19 disease, the SARS-CoV-2 virus, its variants and common mutations, respectively. They were used together with the EasyNER NLP tool to extract and annotate all 764 398 abstracts in the CORD-19 dataset, creating a very large silver standard corpus (named Lund-Annotated-CORD-19 corpus). This was complemented with a small gold standard corpus consisting of PubMed abstracts manually annotated for key entity classes such as disease, virus, symptom, protein/gene, cell type, chemical and species terms. The toolbox can support various text analysis tasks related to COVID-19 such as named entity recognition and co-mention analysis. A preliminary version of the toolbox, which was released early in the pandemic, was for example already used to create a COVID-19 knowledge graph and study the evolution and variation of COVID-19-related terminology. In addition, the toolbox can be applied in the development of other NLP tools, for example to train and evaluate large language models. Analysis of the Lund-Annotated-CORD-19 corpus, which represents a large section of the coronavirus-related literature published until 2022, can provide both linguistic and medical insights. We observed matches for hundreds of SARS-CoV-2 and COVID-19 synonyms, indicating a high degree of term variability, which has also been reported for other datasets. Terms referring to the disease were the most frequent by far, followed by terms referring to the virus. We also found thousands of mentions of variants and mutations. However, most of these referred to a small group of highly studied variants and mutations, reflecting research biases and revealing understudied aspects of the virus. Conclusions The presented toolbox has a broad variety of NLP applications related to COVID-19. It is freely available on GitHub (on https://github.com/Aitslab/Covid19 ) and Zenodo (https://doi.org/10.5281/zenodo.15395348).

Article activity feed