A multi-class gene classifier for SARS-CoV-2 variants based on convolutional neural network

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Surveillance of circulating variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is of great importance in controlling the coronavirus disease 2019 (COVID-19) pandemic. We propose an alignment-free in silico approach for classifying SARS-CoV-2 variants based on their genomic sequences. A deep learning model was constructed utilizing a stacked 1-D convolutional neural network and multilayer perceptron (MLP). The pre-processed genomic sequencing data of the four SARS-CoV-2 variants were first fed to three stacked convolution-pooling nets to extract local linkage patterns in the sequences. Then a 2-layer MLP was used to compute the correlations between the input and output. Finally, a logistic regression model transformed the output and returned the probability values. Learning curves and stratified 10-fold cross-validation showed that the proposed classifier enables robust variant classification. External validation of the classifier showed an accuracy of 0.9962, precision of 0.9963, recall of 0.9963 and F1 score of 0.9962, outperforming other machine learning methods, including logistic regression, K-nearest neighbor, support vector machine, and random forest. By comparing our model with an MLP model without the convolution-pooling network, we demonstrate the essential role of convolution in extracting viral variant features. Thus, our results indicate that the proposed convolution-based multi-class gene classifier is efficient for the variant classification of SARS-CoV-2.

Article activity feed

  1. SciScore for 10.1101/2021.11.22.469492: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    2.1 Dataset: The SARS-CoV-2 genomic sequences were collected from the NCBI SARS-CoV-2 Resources (SARS-CoV-2-Sequences, RRID: SCR_018319).
    The SARS-CoV-2
    detected: SARS-CoV-2-Sequences ( RRID:SCR_018319)
    2.7 Implementation and evaluation: The experiments were carried out on Python 3.8 on an Intel® CoreTM i7, 2.00 GHz with 8GM RAM.
    Python
    suggested: (IPython, RRID:SCR_001658)
    Logistic regression (LR), K-nearest neighbor (KNN), support vector machine (SVM) and random forest (RF) are implemented with Scikit-learn (Pedregosa et al., 2011) while MLP and CNN were implemented with TensorFlow (Abadi et al., 2016).
    Scikit-learn
    suggested: (scikit-learn, RRID:SCR_002577)
    TensorFlow
    suggested: (tensorflow, RRID:SCR_016345)

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.