Multi-label topic classification for COVID-19 literature annotation using an ensemble model based on PubMedBERT

Abstract

The BioCreative VII Track 5 calls for participants to tackle the multi-label classification task for automated topic annotation of COVID-19 literature. In our participation, we evaluated several deep learning models built on PubMedBERT, a pre-trained language model, with different strategies addressing the challenges of the task. Specifically, multi-instance learning was used to deal with the large variation in the lengths of the articles, and focal loss function was used to address the imbalance in the distribution of different topics. We found that the ensemble model performed the best among all the models we have tested. Test results of our submissions showed that our approach was able to achieve satisfactory performance with an F1 score of 0.9247, which is significantly better than the baseline model (F1 score: 0.8678) and the mean of all the submissions (F1 score: 0.8931).

SciScore for 10.1101/2021.10.26.465946: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
PubMedBERT was pre-trained from scratch with corpus developed from PubMed articles and it consistently outperformed all the other BERT models in most biomedical natural language processing tasks (5).	PubMed suggested: (PubMed, RRID:SCR_004846)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results …

SciScore for 10.1101/2021.10.26.465946: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
PubMedBERT was pre-trained from scratch with corpus developed from PubMed articles and it consistently outperformed all the other BERT models in most biomedical natural language processing tasks (5).	PubMed suggested: (PubMed, RRID:SCR_004846)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Multi-label topic classification for COVID-19 literature annotation using an ensemble model based on PubMedBERT

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Harnessing Ensemble Deep Learning for DNA Sequence Classification: Evaluating CNN, BiLSTM, and GRU Architectures

An Empowered Transfer Learning Model for Predictive Classification of Lung Cancer

Building a Question-Answering System to Extract Information From PDF Files Using BERT Transformers

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Harnessing Ensemble Deep Learning for DNA Sequence Classification: Evaluating CNN, BiLSTM, and GRU Architectures

An Empowered Transfer Learning Model for Predictive Classification of Lung Cancer

Building a Question-Answering System to Extract Information From PDF Files Using BERT Transformers