Analysis of SARS-CoV-2 Mutations Over Time Reveals Increasing Prevalence of Variants in the Spike Protein and RNA-Dependent RNA Polymerase
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (ScreenIT)
Abstract
Amid the ongoing COVID-19 pandemic, it has become increasingly important to monitor the mutations that arise in the SARS-CoV-2 virus, to prepare public health strategies and guide the further development of vaccines and therapeutics. The spike (S) protein and the proteins comprising the RNA-Dependent RNA Polymerase (RdRP) are key vaccine and drug targets, respectively, making mutation surveillance of these proteins of great importance.
Full protein sequences for the spike proteins and RNA-dependent RNA polymerase proteins were downloaded from the GISAID database, aligned, and the variants identified. Polymorphisms in the protein sequence were investigated at the protein structural level and examined longitudinally in order to identify sequence and strain variants that are emerging over time. Our analysis revealed a group of variants in the spike protein and the polymerase complex that appeared in August, and account for around five percent of the genomes analyzed up to the last week of October. A structural analysis also facilitated investigation of several unique variants in the receptor binding domain and the N-terminal domain of the spike protein, with high-frequency mutations occurring more commonly in these regions. The identification of new variants emphasizes the need for further study on the effects of these mutations and the implications of their increased prevalence, particularly as these mutations may impact vaccine or therapeutic efficacy.
Article activity feed
-
SciScore for 10.1101/2021.03.05.433666: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Sex as a biological variable not detected. Table 2: Resources
Software and Algorithms Sentences Resources The reference genome used in our analysis was the Severe Acute Respiratory Syndrome Coronavirus 2 Isolate WIV04 (WIV04), sequenced in Wuhan, China on December 30th, 2019.12 The raw FASTA file was split by protein into 27 files using a Python script in Jupyter Notebook (version 6.1.4),13 and each protein was processed separately through all subsequent steps. Pythonsuggested: (IPython, RRID:SCR_001658)Filtering of Sequences: Sequences were filtered in Python using the Biopython … SciScore for 10.1101/2021.03.05.433666: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Sex as a biological variable not detected. Table 2: Resources
Software and Algorithms Sentences Resources The reference genome used in our analysis was the Severe Acute Respiratory Syndrome Coronavirus 2 Isolate WIV04 (WIV04), sequenced in Wuhan, China on December 30th, 2019.12 The raw FASTA file was split by protein into 27 files using a Python script in Jupyter Notebook (version 6.1.4),13 and each protein was processed separately through all subsequent steps. Pythonsuggested: (IPython, RRID:SCR_001658)Filtering of Sequences: Sequences were filtered in Python using the Biopython SeqIO module. Biopythonsuggested: (Biopython, RRID:SCR_007173)Sequence Dereplication: In order to streamline our computational pipeline, identical sequences were condensed into clusters using USEARCH (version 11.0.667).15 Clusters, representing unique sequences, were written out to a FASTA file with the ID of the cluster and the number of sequences in the cluster. USEARCHsuggested: (mubiomics, RRID:SCR_006785)Clustal Omega was selected based on the balance between alignment quality and speed. Clustal Omegasuggested: (Clustal Omega, RRID:SCR_001591)Parsing of Multiple Sequence Alignment: A Python script was developed in Jupyter notebook to automatically parse the aligned sequences for variants given the ID of the cluster containing the reference sequence, which was determined by searching for “WIV04” in the cluster information file using RStudio (version 1.3.1093).18 The Python script scanned through the other clusters (Supplementary Figure S1), comparing each codon with the corresponding codon of the reference cluster. RStudiosuggested: (RStudio, RRID:SCR_000432)Three-dimensional Visualization of Frequently Mutated Sites: Structures of the spike protein and the RNA-dependent RNA polymerase (RdRP) complex were downloaded from the Protein Data Bank (PDB)20 and visualized using PyMOL. PyMOLsuggested: (PyMOL, RRID:SCR_000305)Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- No funding statement was detected.
- No protocol registration statement was detected.
-
