Mutation landscape of SARS-CoV-2 reveals five mutually exclusive clusters of leading and trailing single nucleotide substitutions

This article has been Reviewed by the following groups

Read the full article

Abstract

The COVID-19 pandemic has spread across the globe at an alarming rate. However, unlike any of the previous global outbreaks the availability of a large number of SARS-CoV-2 sequences provides us with a unique opportunity to understand viral evolution in real time. We analysed 1448 full-length (>29000 nt) sequences available and identified 40 single-nucleotide substitutions occurring in >1% of the genomes. Majority of the substitutions were C to T or G to A. We identify C/Gs with an upstream TTT trinucleotide motif as hotspots for mutations in the SARS-CoV-2 genome. Interestingly, three of the 40 substitutions occur within highly conserved secondary structures in the 5’ and 3’ regions of the genomic RNA that are critical for the virus life cycle. Furthermore, clustering analysis revealed unique geographical distribution of SARS-CoV-2 variants defined by their mutation profile. Of note, we observed several co-occurring mutations that almost never occur individually. We define five mutually exclusive lineages (A1, B1, C1, D1 and E1) of SARS-CoV-2 which account for about three quarters of the genomes analysed. We identify lineage-defining leading mutations in the SARS-CoV-2 genome which precede the occurrence of sub-lineage defining trailing mutations. The identification of mutually exclusive lineage-defining mutations with geographically restricted patterns of distribution has potential implications for diagnosis, pathogenesis and vaccine design. Our work provides novel insights on the temporal evolution of SARS-CoV-2.

Importance

The SARS-CoV-2 / COVID-19 pandemic has spread far and wide with high infectivity. However, the severeness of the infection as well as the mortality rates differ greatly across different geographic areas. Here we report high frequency mutations in the SARS-CoV-2 genomes which show the presence of linage-defining, leading and trailing mutations. Moreover, we propose for the first time, five mutually exclusive clusters of SARS-CoV-2 which account for 75% of the genomes analysed. This will have implications in diagnosis, pathogenesis and vaccine design

Article activity feed

  1. SciScore for 10.1101/2020.05.07.082768: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    The Alignment and refinement of the 1448 sequences with the SARS-CoV-2 reference genome were performed by using MUSCLE multiple sequence alignment software (45)..
    MUSCLE
    suggested: (MUSCLE, RRID:SCR_011812)
    To understand the probability of finding TTT trinucleotide upstream of a random C/G position, we first mapped 10000 random positions (based on random numbers generated in MS Excel) on the SARS-CoV-2 genome and identified 3826 G/C residues.
    MS Excel
    suggested: None
    Clustering analysis and defining leading and trailing mutations: Clustering was performed on the 1448 SARS-CoV-2 sequences with a custom script written using Python programming language and the data was visualized using Seaborn Statistical Visualization Tool (https://seaborn.pydata.org/).
    Python
    suggested: (IPython, RRID:SCR_001658)
    Seaborn Statistical Visualization Tool
    suggested: (seaborn, RRID:SCR_018132)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.