Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity

Abstract

Through the COVID-19 pandemic, SARS-CoV-2 has gained and lost multiple mutations in novel or unexpected combinations. Predicting how complex mutations affect COVID-19 disease severity is critical in planning public health responses as the virus continues to evolve. This paper presents a novel computational framework to complement conventional lineage classification and applies it to predict the severe disease potential of viral genetic variation. The transformer-based neural network model architecture has additional layers that provide sample embeddings and sequence-wide attention for interpretation and visualization. First, training a model to predict SARS-CoV-2 taxonomy validates the architecture’s interpretability. Second, an interpretable predictive model of disease severity is trained on spike protein sequence and patient metadata from GISAID. Confounding effects of changing patient demographics, increasing vaccination rates, and improving treatment over time are addressed by including demographics and case date as independent input to the neural network model. The resulting model can be interpreted to identify potentially significant virus mutations and proves to be a robust predctive tool. Although trained on sequence data obtained entirely before the availability of empirical data for Omicron, the model can predict the Omicron’s reduced risk of severe disease, in accord with epidemiological and experimental data.

SciScore for 10.1101/2021.12.26.21268414: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
We then balance classes to account for those labels which have fewer than 300 representatives using class_weight.compute_class_weight in scikit-learn (sklearn) version 1.01, which obtains the weights of samples in a class by dividing the average number of samples in each of all classes by the number of samples in that class.	scikit-learn suggested: (scikit-learn, RRID:SCR_002577)
A Python (Google Colab) notebook including the preprocessing steps performed for Omicron variant sequence data is available for download at …

SciScore for 10.1101/2021.12.26.21268414: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
We then balance classes to account for those labels which have fewer than 300 representatives using class_weight.compute_class_weight in scikit-learn (sklearn) version 1.01, which obtains the weights of samples in a class by dividing the average number of samples in each of all classes by the number of samples in that class.	scikit-learn suggested: (scikit-learn, RRID:SCR_002577)
A Python (Google Colab) notebook including the preprocessing steps performed for Omicron variant sequence data is available for download at https://github.com/bahrad/Covid/blob/main/Covid_Predict_Omicron_Resistance.ipynb. Model Architecture: Fig. 1 shows an overview of our deep neural network model architecture.	Python suggested: (IPython, RRID:SCR_001658)
The hardware used for neural network model training and evaluation, as well as the data pre-processing described above, consists of Nvidia Tesla P80 GPUs (primarily) and Google Cloud Tensor Processor Units (TPUs) the Google’s Colab environment, running Tensorflow 2.70 and Python 3.7.12, and Nvidia Tesla V100-SXM2 GPUs on the Drexel University Research Computing Facility (URCF), running Tensorflow 2.4 and Python 3.8.	Tensorflow suggested: (tensorflow, RRID:SCR_016345)
For example, patient metadata classification trains on 44,003 samples, which requires 51 sec/epoch in the Google TPU environment had 51 sec/epoch, while on a GPU unit in the URCF, the time per epoch was 480 seconds, representing a 9.2-fold TPU-speedup.	Google suggested: (Google, RRID:SCR_017097)

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on page 23. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model

Rebuilding the Antibiotic Pipeline with Guided Generative Models

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model

Rebuilding the Antibiotic Pipeline with Guided Generative Models