Emerging SARS-CoV-2 Diversity Revealed by Rapid Whole-Genome Sequence Typing

This article has been Reviewed by the following groups

Read the full article

Abstract

Discrete classification of SARS-CoV-2 viral genotypes can identify emerging strains and detect geographic spread, viral diversity, and transmission events. We developed a tool (GNU-based Virus IDentification [GNUVID]) that integrates whole-genome multilocus sequence typing and a supervised machine learning random forest-based classifier. We used GNUVID to assign sequence type (ST) profiles to all high-quality genomes available from GISAID. STs were clustered into clonal complexes (CCs) and then used to train a machine learning classifier. We used this tool to detect potential introduction and exportation events and to estimate effective viral diversity across locations and over time in 16 US states. GNUVID is a highly scalable tool for viral genotype classification (https://github.com/ahmedmagds/GNUVID) that can quickly classify hundreds of thousands of genomes in a way that is consistent with phylogeny. Our genotyping ST/CC analysis uncovered dynamic local changes in ST/CC prevalence and diversity with multiple replacement events in different states, an average of 20.6 putative introductions and 7.5 exportations for each state over the time period analyzed. We introduce the use of effective diversity metrics (Hill numbers) that can be used to estimate the impact of interventions (e.g., travel restrictions, vaccine uptake, mask mandates) on the variation in circulating viruses. Our classification tool uncovered multiple introduction and exportation events, as well as waves of expansion and replacement of SARS-CoV-2 genotypes in different states. GNUVID classification lends itself to measures of ecological diversity, and, with systematic genomic sampling, it could be used to track circulating viral diversity and identify emerging clones and hotspots.

Article activity feed

  1. SciScore for 10.1101/2020.12.28.424582: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    The 15,136 variant positions (features) matrix of the 53,565 CC-labelled genomes were then one-hot encoded, in which each SNP is replaced with a binary vector, and were used to train a random forest classifier in Scikit-learn (Pedregosa, et al. 2011).
    Scikit-learn
    suggested: (scikit-learn, RRID:SCR_002577)
    The dates of state-wide mask mandates were the dates when face covering was required in indoor public spaces and in outdoor public spaces when social distancing is not possible (Abbott 2020; Allen 2020; Angell 2020; Baker 2020; Cuomo 2020; Edwards 2020; Evers 2020; Hogan 2020; Inslee 2020; Kunkel 2020; Lamont 2020; Northam 2020; Saunders 2020; Walz 2020; Whitmer 2020).
    Abbott
    suggested: (Abbott, RRID:SCR_010477)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    One limitation of our classification strategy, as with many schemes that operate in real time, is that paraphyletic groups can occur as a new ST arises from an older ST (e.g. CC258 and CC768 emerged from CC255 and CC258 making CC255 and CC258 paraphyletic, respectively) (Supplementary Figure 1). While this means that not all ST/CC groups will be monophyletic, this property of the nomenclature may be helpful in gauging emergence and replacement of an ancestral form. When the global region of origin for each genome sequence was mapped to each CC there was a strong association of later emerging CCs with certain geographical locations, possibly reflecting relative containment after international travel restrictions (Figure 2). To obtain an up-to-date picture of virus diversity in the US, we analyzed 107,414 high coverage genomes (isolation dates between December 2019 to October 20th 2020) from the GISAID (Supplementary table 1). There were 26,528 genomes isolated in the US in this dataset that belong to 87 of 154 CCs. Strikingly, 35% of the genomes belong to CC258 (GISAID clade GH) and 75% of the genomes are represented by just 10 CCs (CC4, 255, 256, 258, 300, 498 768, 3530, 10221, 21210)). Moreover, 72% (63/87) of the CCs (representing 82% of the genomes) had the spike D614G mutation that has been associated with increased spread (Korber, et al. 2020). Interestingly, none of the US genomes were associated with any of the 12 CCs (26377, 26754, 27693, 27950, 28012, 28825, 29259, 2...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.