Unsupervised classification of SARS-CoV-2 genomic sequences uncovers hidden genetic diversity and suggests an efficient strategy for genomic surveillance

This article has been Reviewed by the following groups

Read the full article

Abstract

Accurate and timely monitoring of emerging genomic diversity is crucial for limiting the spread of potentially more transmissible/pathogenic strains of SARS-CoV-2. At the time of writing, over 1.8M distinct viral genome sequences have been made publicly available, and a sophisticated nomenclature system based on phylogenetic evidence and expert manual curation has allowed the relatively rapid classification of emerging lineages of potential concern.

Here, we propose a complementary approach that integrates fine-grained spatiotemporal estimates of allele frequency with unsupervised clustering of viral haplotypes, and demonstrate that multiple highly frequent genetic variants, arising within large and/or rapidly expanding SARS-CoV-2 lineages, have highly biased geographic distributions and are not adequately captured by current SARS-CoV-2 nomenclature standards.

Our results advocate a partial revision of current methods used to track SARS-CoV-2 genomic diversity and highlight the importance of the application of strategies based on the systematic analysis and integration of regional data. Here we provide a complementary, completely automated and reproducible framework for the mapping of genetic diversity in time and across different geographic regions, and for the prioritization of virus variants of potential concern. We believe that the approach outlined in this study will contribute to relevant advances to current genomic surveillance methods.

Article activity feed

  1. SciScore for 10.1101/2021.06.23.449558: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Ethicsnot detected.
    Sex as a biological variablenot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    A standalone Galaxy implementation is available at: http://corgat.cloud.ba.infn.it/galaxy under Tools/utilities for Haplogroup assignment.
    Galaxy
    suggested: (Galaxy, RRID:SCR_006281)
    Haplogroups were established by hierarchical clustering of phenetic profiles of presence/absence of high frequency alleles, by applying the hclust function from the R standard libraries (Maechler et al, 2019).
    hclust
    suggested: (HCLUST, RRID:SCR_009154)
    Identification of sites under selection was performed by applying the MEME and FEL methods, as implemented in the Hyphy package23, to the phylogeny and the concatenated alignment of protein-coding sequences.
    Hyphy
    suggested: (HyPhy, RRID:SCR_016162)
    Clustering of mutation patterns of SARS-CoV-2 lineages/HGs was performed by means of the Phenograph algorithm as implemented by the RPhenograph package47.
    Phenograph
    suggested: (Phenograph, RRID:SCR_016919)
    RPhenograph
    suggested: None

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Additionally, we highlight possible limitations of Pango, the current standard for the classification and nomenclature of SARS-CoV-2, which may prevent the rapid and unsupervised identification of emergent “regional” genomic diversity. By applying a revised implementation of the strategy proposed in Chiara et al19, based on relaxed filters for the inclusion of low-quality genomic assemblies and on the incorporation of regional estimates of allele frequencies, we present a novel, completely automated system for the monitoring of this genomic diversity. Importantly, we observe that our revised approach can correctly associate one or more related haplogroups to all current Variants of Concern (B.1.1.7, P.1, B.1.617.2, B.1.351) and Variants of Interest defined by international health authorities (Table 3). Our approach requires, on average, 20 days between the availability of the first genomic sequence associated with a VOC/VOI and the formation of a corresponding HG. This timeframe is completely in line with that observed for the reporting of current VOCs and VOIs, although no manual intervention is required by our framework. Since B.1.1.7 is currently the most prevalent and rapidly expanding lineage of SARS-CoV-2, the observation that a substantial proportion of high frequency mutations of in this lineage, emerging at different geographic locations, is not captured by current nomenclature standards, bears important implications for the accurate and rapid tracking of other mutat...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.