Classifying COVID-19 variants based on genetic sequences using deep learning models

Abstract

The COrona VIrus Disease (COVID-19) pandemic led to the occurrence of several variants with time. This has led to an increased importance of understanding sequence data related to COVID-19. In this chapter, we propose an alignment-free k-mer based LSTM (Long Short-Term Memory) deep learning model that can classify 20 different variants of COVID-19. We handle the class imbalance problem by sampling a fixed number of sequences for each class label. We handle the vanishing gradient problem in LSTMs arising from long sequences by dividing the sequence into fixed lengths and obtaining results on individual runs. Our results show that one-vs-all classifiers have test accuracies as high as 92.5% with tuned hyperparameters compared to the multi-class classifier model. Our experiments show higher overall accuracies for B.1.1.214, B.1.177.21, B.1.1.7, B.1.526, and P.1 on the one-vs-all classifiers, suggesting the presence of distinct mutations in these variants. Our results show that embedding vector size and batch sizes have insignificant improvement in accuracies, but changing from 2-mers to 3-mers mostly improves accuracies. We also studied individual runs which show that most accuracies improved after the 20th run, indicating that these sequence positions may have more contributions to distinguishing among different COVID-19 variants.

SciScore for 10.1101/2021.06.29.450335: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The sampled sequences were chosen randomly from each of the lineages using random.choice from numpy [56].	numpy suggested: (NumPy, RRID:SCR_008633)
We used integer encoding and padding prior to feeding in the sequences into our LSTM framework, all coded using keras [57] with a Tensorflow [58] backend in Python3.	Python3 suggested: None

Results from OddPub: Thank you for sharing your code.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier…

SciScore for 10.1101/2021.06.29.450335: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The sampled sequences were chosen randomly from each of the lineages using random.choice from numpy [56].	numpy suggested: (NumPy, RRID:SCR_008633)
We used integer encoding and padding prior to feeding in the sequences into our LSTM framework, all coded using keras [57] with a Tensorflow [58] backend in Python3.	Python3 suggested: None

Results from OddPub: Thank you for sharing your code.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Classifying COVID-19 variants based on genetic sequences using deep learning models

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

upsAI: A high-accuracy machine learning classifier for predicting Plasmodium falciparum var gene upstream groups

Accurate detection of pathogenic structural variants guided by multi-platform comparison

Improving classification on imbalanced genomic data via KDE–based synthetic sampling

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

upsAI: A high-accuracy machine learning classifier for predicting Plasmodium falciparum var gene upstream groups

Accurate detection of pathogenic structural variants guided by multi-platform comparison

Improving classification on imbalanced genomic data via KDE–based synthetic sampling