Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Through the COVID-19 pandemic, SARS-CoV-2 has gained and lost multiple mutations in novel or unexpected combinations. Predicting how complex mutations affect COVID-19 disease severity is critical in planning public health responses as the virus continues to evolve. This paper presents a novel computational framework to complement conventional lineage classification and applies it to predict the severe disease potential of viral genetic variation. The transformer-based neural network model architecture has additional layers that provide sample embeddings and sequence-wide attention for interpretation and visualization. First, training a model to predict SARS-CoV-2 taxonomy validates the architecture’s interpretability. Second, an interpretable predictive model of disease severity is trained on spike protein sequence and patient metadata from GISAID. Confounding effects of changing patient demographics, increasing vaccination rates, and improving treatment over time are addressed by including demographics and case date as independent input to the neural network model. The resulting model can be interpreted to identify potentially significant virus mutations and proves to be a robust predctive tool. Although trained on sequence data obtained entirely before the availability of empirical data for Omicron, the model can predict the Omicron’s reduced risk of severe disease, in accord with epidemiological and experimental data.

Article activity feed

  1. SciScore for 10.1101/2021.12.26.21268414: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    We then balance classes to account for those labels which have fewer than 300 representatives using class_weight.compute_class_weight in scikit-learn (sklearn) version 1.01, which obtains the weights of samples in a class by dividing the average number of samples in each of all classes by the number of samples in that class.
    scikit-learn
    suggested: (scikit-learn, RRID:SCR_002577)
    A Python (Google Colab) notebook including the preprocessing steps performed for Omicron variant sequence data is available for download at https://github.com/bahrad/Covid/blob/main/Covid_Predict_Omicron_Resistance.ipynb. Model Architecture: Fig. 1 shows an overview of our deep neural network model architecture.
    Python
    suggested: (IPython, RRID:SCR_001658)
    The hardware used for neural network model training and evaluation, as well as the data pre-processing described above, consists of Nvidia Tesla P80 GPUs (primarily) and Google Cloud Tensor Processor Units (TPUs) the Google’s Colab environment, running Tensorflow 2.70 and Python 3.7.12, and Nvidia Tesla V100-SXM2 GPUs on the Drexel University Research Computing Facility (URCF), running Tensorflow 2.4 and Python 3.8.
    Tensorflow
    suggested: (tensorflow, RRID:SCR_016345)
    For example, patient metadata classification trains on 44,003 samples, which requires 51 sec/epoch in the Google TPU environment had 51 sec/epoch, while on a GPU unit in the URCF, the time per epoch was 480 seconds, representing a 9.2-fold TPU-speedup.
    Google
    suggested: (Google, RRID:SCR_017097)

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on page 23. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.