Model evolution in SARS-CoV-2 spike protein sequences using a generative neural network

Anup Kumar

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

Modelling evolutionary elements inherent in protein sequences, emerging from one clade into another of the SARS-CoV-2 virus, would provide insights to augment our understanding of its impact on public health and may help in formulating better strategies to contain its spread. Deep learning methods have been used to model protein sequences for SARS-CoV-2 viruses. A few significant drawbacks in these studies include being deficient in modelling end-to-end protein sequences, modelling only those genomic positions that show high activity and upsampling the number of sequences at each genomic position for balancing the frequency of mutations. To mitigate such drawbacks, the current approach uses a generative model, an encoder-decoder neural network, to learn the natural progression of spike protein sequences through adjacent clades of the phylogenetic tree of Nextstrain clades. Encoder transforms a set of spike protein sequences from the source clade (20A) into its latent representation. Decoder uses the latent representation, along with Gaussian distributed noise, to generate a different set of protein sequences that are closer to the target clade (20B). The source and target clades are adjacent nodes in the phylogenetic tree of different evolving clades of the SARS-CoV-2 virus. Sequences of amino acids are generated, for the entire length, at each genomic position using the latent representation of the amino acid generated at a previous step. Using trained models, protein sequences from the source clade are used to generate sequences that form a collection of evolved sequences belonging to all children clades of the source clade. A comparison of this predicted evolution (between source and generated sequences) of proteins with the true evolution (between source and target sequences) shows a high pearson correlation (> 0.7). Moreover, the distribution of the frequencies of substitutions per genomic position, including high- and low-frequency positions, in source-target sequences and source-generated sequences exhibit a high resemblance (pearson correlation > 0.7). In addition, the model partially predicts a few substitutions at specific genomic positions for the sequences of unseen clades (20J (Gamma)) where they show little activity during training. These outcomes show the potential of this approach in learning the latent mechanism of evolution of SARS-CoV-2 viral sequences.

Codebase

https://github.com/anuprulez/clade_prediction

SciScore for 10.1101/2022.04.12.487999: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Nextclade (2) tool in Galaxy (13) is used to assign a clade to each protein sequence.	Galaxy suggested: (Galaxy, RRID:SCR_006281)
Softwares: Tensorflow 2.7.0 is used for creating the architecture of the encoder-decoder neural network using Python 3.9.7.	Python suggested: (IPython, RRID:SCR_001658)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

But, there are a few limitations …

SciScore for 10.1101/2022.04.12.487999: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Nextclade (2) tool in Galaxy (13) is used to assign a clade to each protein sequence.	Galaxy suggested: (Galaxy, RRID:SCR_006281)
Softwares: Tensorflow 2.7.0 is used for creating the architecture of the encoder-decoder neural network using Python 3.9.7.	Python suggested: (IPython, RRID:SCR_001658)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

But, there are a few limitations of this approach. One is related to data preparation. The sequences of clades 20A and 20B have been chosen randomly from a pool of clade labelled sequences. The clade 20A came before 20B, therefore, it is important to choose sequences of clade 20A that have been collected before sequences of 20B. The categorisation of protein sequences into clades by the Nextclade tool may involve errors. Moreover, there is an unequal distribution of the submitted sequences based on geography. Around half of the submitted sequences for clades 20A and 20B in GISAID come from the US and the rest of the nations contribute to the other half of these clades. In addition to mitigating such biases in the training data, it is also important to train models on different branches of the phylogenetic tree and compare their performances.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Version published to 10.1101/2022.04.12.487999 on bioRxiv
Apr 12, 2022

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025
Dengue Virus Type 2: Global Epidemiology, Molecular Evolution, and Immune Response Insights

This article has 5 authors:
1. Qun Chen
2. Peipei Ye
3. Mengye Ma
4. Zhu Chen
5. Liming Jiang
This article has no evaluationsLatest version Jan 30, 2026
Insights into Genomic Dynamics and Plasticity in the Monkeypox Virus from the 2022 Outbreak

This article has 15 authors:
1. Michela Deiana
2. Elena Locatelli
3. Laura Veschetti
4. Simone Malagò
5. Antonio Mori
6. Denise Lavezzari
7. Silvia Accordini
8. Niccolò Ronzoni
9. Andrea Angheben
10. Giovanni Malerba
11. Evelina Tacconelli
12. Maria Grazia Cusi
13. Federico Giovanni Gobbi
14. Chiara Piubelli
15. Concetta Castilletti
This article has no evaluationsLatest version Jan 29, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Codebase

Article activity feed

Related articles

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Dengue Virus Type 2: Global Epidemiology, Molecular Evolution, and Immune Response Insights

Insights into Genomic Dynamics and Plasticity in the Monkeypox Virus from the 2022 Outbreak