English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

Salma Kazemi Rashed
Rafsan Ahmed
Johan Frid
Sonja Aits

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. Results Here we present an NLP toolbox comprising COVID-19-related dictionaries and annotated corpora in English as well as useful code and workflows for their update and use. The dictionaries contain terms referring to the COVID-19 disease, the SARS-CoV-2 virus, its variants and common mutations, respectively. They were used together with the EasyNER NLP tool to extract and annotate all 764 398 abstracts in the CORD-19 dataset, creating a very large silver standard corpus (named Lund-Annotated-CORD-19 corpus). This was complemented with a small gold standard corpus consisting of PubMed abstracts manually annotated for key entity classes such as disease, virus, symptom, protein/gene, cell type, chemical and species terms. The toolbox can support various text analysis tasks related to COVID-19 such as named entity recognition and co-mention analysis. A preliminary version of the toolbox, which was released early in the pandemic, was for example already used to create a COVID-19 knowledge graph and study the evolution and variation of COVID-19-related terminology. In addition, the toolbox can be applied in the development of other NLP tools, for example to train and evaluate large language models. Analysis of the Lund-Annotated-CORD-19 corpus, which represents a large section of the coronavirus-related literature published until 2022, can provide both linguistic and medical insights. We observed matches for hundreds of SARS-CoV-2 and COVID-19 synonyms, indicating a high degree of term variability, which has also been reported for other datasets. Terms referring to the disease were the most frequent by far, followed by terms referring to the virus. We also found thousands of mentions of variants and mutations. However, most of these referred to a small group of highly studied variants and mutations, reflecting research biases and revealing understudied aspects of the virus. Conclusions The presented toolbox has a broad variety of NLP applications related to COVID-19. It is freely available on GitHub (on https://github.com/Aitslab/Covid19 ) and Zenodo (https://doi.org/10.5281/zenodo.15395348).

Version published to 10.21203/rs.3.rs-7481477/v1 on Research Square
Feb 23, 2026

Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

This article has 2 authors:
1. Badal Nyalang
2. Biman Debbarma
This article has no evaluationsLatest version Mar 31, 2026
LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

This article has 3 authors:
1. Antoine Claude Lemor
2. Shannon Dinan
3. Jeremy Gilbert
This article has no evaluationsLatest version Apr 13, 2026
LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

This article has 3 authors:
1. Antoine Claude Lemor
2. Shannon Dinan
3. Jeremy Gilbert
This article has no evaluationsLatest version Apr 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers