Controlling the SARS-CoV-2 outbreak, insights from large scale whole genome sequences generated across the world

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

SARS-CoV-2 most likely evolved from a bat beta-coronavirus and started infecting humans in December 2019. Since then it has rapidly infected people around the world, with more than 4.5 million confirmed cases by the middle of May 2020. Early genome sequencing of the virus has enabled the development of molecular diagnostics and the commencement of therapy and vaccine development. The analysis of the early sequences showed relatively few evolutionary selection pressures. However, with the rapid worldwide expansion into diverse human populations, significant genetic variations are becoming increasingly likely. The current limitations on social movement between countries also offers the opportunity for these viral variants to become distinct strains with potential implications for diagnostics, therapies and vaccines.

Methods

We used the current sequencing archives (NCBI and GISAID) to investigate 15,487 whole genomes, looking for evidence of strain diversification and selective pressure.

Results

We used 6,294 SNPs to build a phylogenetic tree of SARS-CoV-2 diversity and noted strong evidence for the existence of two major clades and six sub-clades, unevenly distributed across the world. We also noted that convergent evolution has potentially occurred across several locations in the genome, showing selection pressures, including on the spike glycoprotein where we noted a potentially critical mutation that could affect its binding to the ACE2 receptor. We also report on mutations that could prevent current molecular diagnostics from detecting some of the sub-clades.

Conclusion

The worldwide whole genome sequencing effort is revealing the challenge of developing SARS-CoV-2 containment tools suitable for everyone and the need for data to be continually evaluated to ensure accuracy in outbreak estimations.

Article activity feed

  1. SciScore for 10.1101/2020.04.28.066977: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Full SARS-CoV-2 genome sequences were downloaded from the GISAID [6] and NCBI [5], covering isolates collected between December 24, 2019 and April 6, 2020.
    NCBI
    suggested: (NCBI, RRID:SCR_006472)
    Sequences were aligned to the reference genome (NC_045512.2) using mafft software [17].
    mafft
    suggested: (MAFFT, RRID:SCR_011811)
    IQ-TREE (v1.6.12) [18] and BEAST (v1.10.4) [19] software were used to reconstruct the phylogeny tree.
    IQ-TREE
    suggested: (IQ-TREE, RRID:SCR_017254)
    BEAST
    suggested: (BEAST, RRID:SCR_010228)
    The number of independent acquisitions of the mutation was counted using empirical Bayesian ancestral reconstruction methods and custom scripts utilising the ete3 python library [20].
    python
    suggested: (IPython, RRID:SCR_001658)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    The limitations of our study are the relatively small proportion of sampled sequences (>15,000) compared to infections (>4,000,000), an over-reliance on sequences from relatively severe cases, and a potential bias towards collecting more isolates early in an outbreak when most genomes are still very similar. Similarly, much of the sequencing to date has been performed in the USA, UK and Australia. If not accounted for, this bias can lead to false inferences on the impact of particular mutations on transmission. One such example is the D614G mutation [14] which was proposed as leading to strains becoming more transmissible, but further work has identified as a homoplasy site not associated with transmission [15]. Our work has shown this site to be introduced in the C2.1 clade and all subclades, and although it does occur in other clades due to homoplasy, it fails to reach our significance cutoff, with the bulk of the isolates harbouring this mutation arising from a single mutation event in a founder strain. The resulting bias in cluster size means we cannot infer that larger clades are more virulent or more transmissible. However, we could use the transmissibility phenotype, or others involving laboratory-based virulence or patient disease severity, to identify related causal mutations using GWAS or convergent evolution techniques [16]. One roadblock is the small number of polymorphisms observed in SARS-CoV-2, and the unknown contributing role of host genetics and immune syste...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.