Analysis of SARS-CoV-2 Mutations Over Time Reveals Increasing Prevalence of Variants in the Spike Protein and RNA-Dependent RNA Polymerase

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Amid the ongoing COVID-19 pandemic, it has become increasingly important to monitor the mutations that arise in the SARS-CoV-2 virus, to prepare public health strategies and guide the further development of vaccines and therapeutics. The spike (S) protein and the proteins comprising the RNA-Dependent RNA Polymerase (RdRP) are key vaccine and drug targets, respectively, making mutation surveillance of these proteins of great importance.

Full protein sequences for the spike proteins and RNA-dependent RNA polymerase proteins were downloaded from the GISAID database, aligned, and the variants identified. Polymorphisms in the protein sequence were investigated at the protein structural level and examined longitudinally in order to identify sequence and strain variants that are emerging over time. Our analysis revealed a group of variants in the spike protein and the polymerase complex that appeared in August, and account for around five percent of the genomes analyzed up to the last week of October. A structural analysis also facilitated investigation of several unique variants in the receptor binding domain and the N-terminal domain of the spike protein, with high-frequency mutations occurring more commonly in these regions. The identification of new variants emphasizes the need for further study on the effects of these mutations and the implications of their increased prevalence, particularly as these mutations may impact vaccine or therapeutic efficacy.

Article activity feed

  1. SciScore for 10.1101/2021.03.05.433666: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    The reference genome used in our analysis was the Severe Acute Respiratory Syndrome Coronavirus 2 Isolate WIV04 (WIV04), sequenced in Wuhan, China on December 30th, 2019.12 The raw FASTA file was split by protein into 27 files using a Python script in Jupyter Notebook (version 6.1.4),13 and each protein was processed separately through all subsequent steps.
    Python
    suggested: (IPython, RRID:SCR_001658)
    Filtering of Sequences: Sequences were filtered in Python using the Biopython SeqIO module.
    Biopython
    suggested: (Biopython, RRID:SCR_007173)
    Sequence Dereplication: In order to streamline our computational pipeline, identical sequences were condensed into clusters using USEARCH (version 11.0.667).15 Clusters, representing unique sequences, were written out to a FASTA file with the ID of the cluster and the number of sequences in the cluster.
    USEARCH
    suggested: (mubiomics, RRID:SCR_006785)
    Clustal Omega was selected based on the balance between alignment quality and speed.
    Clustal Omega
    suggested: (Clustal Omega, RRID:SCR_001591)
    Parsing of Multiple Sequence Alignment: A Python script was developed in Jupyter notebook to automatically parse the aligned sequences for variants given the ID of the cluster containing the reference sequence, which was determined by searching for “WIV04” in the cluster information file using RStudio (version 1.3.1093).18 The Python script scanned through the other clusters (Supplementary Figure S1), comparing each codon with the corresponding codon of the reference cluster.
    RStudio
    suggested: (RStudio, RRID:SCR_000432)
    Three-dimensional Visualization of Frequently Mutated Sites: Structures of the spike protein and the RNA-dependent RNA polymerase (RdRP) complex were downloaded from the Protein Data Bank (PDB)20 and visualized using PyMOL.
    PyMOL
    suggested: (PyMOL, RRID:SCR_000305)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.