Snapshot of the evolution and mutation patterns of SARS-CoV-2

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The COVID-19 pandemic is the most important public health threat in recent history. Here we study how its causal agent, SARS-CoV-2, has diversified genetically since its first emergence in December 2019. We have created a pipeline combining both phylogenetic and structural analysis to identify possible human-adaptation related mutations in a data set consisting of 4,894 SARS-CoV-2 complete genome sequences. Although the phylogenetic diversity of SARS-CoV-2 is low, the whole genome phylogenetic tree can be divided into five clusters/clades based on the tree topology and clustering of specific mutations, but its branches exhibit low genetic distance and bootstrap support values. We also identified 11 residues that are high-frequency substitutions, with four of them currently showing some signal for potential positive selection. These fast-evolving sites are in the non-structural proteins nsp2, nsp5 (3CL-protease), nsp6, nsp12 (polymerase) and nsp13 (helicase), in accessory proteins (ORF3a, ORF8) and in the structural proteins N and S. Temporal and spatial analysis of these potentially adaptive mutations revealed that the incidence of some of these sites was declining after having reached an (often local) peak, whereas the frequency of other sites is continually increasing and now exhibit a worldwide distribution. Structural analysis revealed that the mutations are located on the surface of the proteins that modulate biochemical properties. We speculate that this improves binding to cellular proteins and hence represents fine-tuning of adaptation to human cells. Our study has implications for the design of biochemical and clinical experiments to assess whether important properties of SARS-CoV-2 have changed during the epidemic.

Article activity feed

  1. SciScore for 10.1101/2020.07.04.187435: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Sequence Alignment: The 4,894 complete genome sequences of SARS-CoV-2 were aligned with MAFFT v721 using the G-INS-i strategy and manually revised by using MEGA 7.022.
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    MEGA
    suggested: (Mega BLAST, RRID:SCR_011920)
    For the next analysis, the twelve amino acid sequence sets were aligned by MEGA 7.0 using MUSCLE (Codons).
    MUSCLE
    suggested: (MUSCLE, RRID:SCR_011812)
    In order to perform a more thorough search through tree space, we used both default IQ-TREE settings as well as additional parameters to intensify the search algorithm (allnni, ntop=100 and nbest=20).
    IQ-TREE
    suggested: (IQ-TREE, RRID:SCR_017254)
    Evolutionary analysis by determination of the ratio of non-synonymous versus synonymous nucleotide substitutions: The HyPhy software package was used to estimate the ratio of non-synonymous substitutions versus synonymous substitutions (dN/dS) and to identify the sites that are subjected to potential positive selection26.
    HyPhy
    suggested: (HyPhy, RRID:SCR_016162)
    Structural analysis: The software PyMol (https://pymol.org/2/) was used to create the figures from the pdb files.
    PyMol
    suggested: (PyMOL, RRID:SCR_000305)
    Prediction of phosphorylation sites was done with the NetPhos 3.1 tool (http://www.cbs.dtu.dk/services/NetPhos/).
    NetPhos
    suggested: (NetPhos, RRID:SCR_017975)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Although these sites have selection signals, the selection analysis has some limitations and hence we consider the results as preliminary. It has been reported that some mutations that seemingly arise multiple times along the phylogenetic tree may be caused by sequencing error and/or are the result of either artefactual lab recombination, or potential hypermutation53 (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473). In addition, during an ongoing pandemic, purifying selection signals occur frequently and recombination events between viruses might obscure the signal53. Furthermore, we cannot exclude that some of the exchanges are the result of non-selective conditions, i.e. due to a “genetic bottleneck”. A patient releases droplet that contain only a limited virus population which does not representative of the whole virus “swarm” replicating in his body. One droplet that by chance does not contain a single particle representing the original strain might then infect another person where it creates a new and different virus population by a ‘founder’ effect. However, since the pandemic continues and new viruses will be sequenced, it is worthwhile to analyze whether one of the sites become positively selected. In summary, our study revealed that in the early large genome of SARS-CoV-2 only a few amino acids are exchanged and hence the selection pressure is low, which is consistent with the conclusion in the good sequence analysis reported by MacLean et al 53...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.