Distinct mutations and lineages of SARS‐CoV‐2 virus in the early phase of COVID‐19 pandemic and subsequent 1‐year global expansion

This article has been Reviewed by the following groups

Read the full article

Abstract

A novel coronavirus, SARS‐CoV‐2, has caused over 274 million cases and over 5.3 million deaths worldwide since it occurred in December 2019 in Wuhan, China. Here we conceptualized the temporospatial evolutionary and expansion dynamics of SARS‐CoV‐2 by taking a series of the cross‐sectional view of viral genomes from early outbreak in January 2020 in Wuhan to the early phase of global ignition in early April, and finally to the subsequent global expansion by late December 2020. Based on the phylogenetic analysis of the early patients in Wuhan, Wuhan/WH04/2020 is supposed to be a more appropriate reference genome of SARS‐CoV‐2, instead of the first sequenced genome Wuhan‐Hu‐1. By scrutinizing the cases from the very early outbreak, we found a viral genotype from the Seafood Market in Wuhan featured with two concurrent mutations (i.e., M type) had become the overwhelmingly dominant genotype (95.3%) of the pandemic 1 year later. By analyzing 4013 SARS‐CoV‐2 genomes from different continents by early April, we were able to interrogate the viral genomic composition dynamics of the initial phase of global ignition over a time span of 14 weeks. Eleven major viral genotypes with unique geographic distributions were also identified. WE1 type, a descendant of M and predominantly witnessed in western Europe, consisted of half of all the cases (50.2%) at the time. The mutations of major genotypes at the same hierarchical level were mutually exclusive, which implies that various genotypes bearing the specific mutations were propagated during human‐to‐human transmission, not by accumulating hot‐spot mutations during the replication of individual viral genomes. As the pandemic was unfolding, we also used the same approach to analyze 261 323 SARS‐CoV‐2 genomes from the world since the outbreak in Wuhan (i.e., including all the publicly available viral genomes) to recapitulate our findings over 1‐year time span. By December 25, 2020, 95.3% of global cases were M type and 93.0% of M‐type cases were WE1. In fact, at present all the five variants of concern (VOC) are the descendants of WE1 type. This study demonstrates that viral genotypes can be utilized as molecular barcodes in combination with epidemiologic data to monitor the spreading routes of the pandemic and evaluate the effectiveness of control measures. Moreover, the dynamics of viral mutational spectrum in the study may help the early identification of new strains in patients to reduce further spread of infection, guide the development of molecular diagnosis and vaccines against COVID‐19, and help assess their accuracy and efficacy in real world at real time.

Article activity feed

  1. SciScore for 10.1101/2021.01.05.425339: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    RandomizationIf we assume sequencing errors occurred randomly along the viral genome, the maximum sequencing error rate for each base per genome can be calculated as 10/2/4013 = 0.00125.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    After filtering, all the remaining sequences were mapped to the reference genome by a dual alignment software MAFFT (v7.450) which takes into consideration of both amino acid or nucleotide sequences.
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    2.4 Phylogenetic tree analysis: In order to find evolutionarily related coronavirus with SARS-CoV-2, the reference genome sequences (Genbank ID: MN908947.3) was used to perform BLAST via NCBI betacoronavirus sequence dataset (https://blast.ncbi.nlm.nih.gov/Blast.cgi).
    BLAST
    suggested: (BLASTX, RRID:SCR_001653)
    https://blast.ncbi.nlm.nih.gov/Blast.cgi
    suggested: (TBLASTX, RRID:SCR_011823)
    After mutation detection, the matrix of mutations for all samples was used to perform the unsupervised cluster analysis via Pheatmap (v1.0.12) package of R. 2.6 Strain of Origin (SOO) algorithm: 19 genotypes were selected from clustering analysis and defined in the Pedigree chart (Fig. 3b).
    Pheatmap
    suggested: (pheatmap, RRID:SCR_016418)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.