Unsupervised classification of SARS-CoV-2 genomic sequences uncovers hidden genetic diversity and suggests an efficient strategy for genomic surveillance
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
Accurate and timely monitoring of emerging genomic diversity is crucial for limiting the spread of potentially more transmissible/pathogenic strains of SARS-CoV-2. At the time of writing, over 1.8M distinct viral genome sequences have been made publicly available, and a sophisticated nomenclature system based on phylogenetic evidence and expert manual curation has allowed the relatively rapid classification of emerging lineages of potential concern.
Here, we propose a complementary approach that integrates fine-grained spatiotemporal estimates of allele frequency with unsupervised clustering of viral haplotypes, and demonstrate that multiple highly frequent genetic variants, arising within large and/or rapidly expanding SARS-CoV-2 lineages, have highly biased geographic distributions and are not adequately captured by current SARS-CoV-2 nomenclature standards.
Our results advocate a partial revision of current methods used to track SARS-CoV-2 genomic diversity and highlight the importance of the application of strategies based on the systematic analysis and integration of regional data. Here we provide a complementary, completely automated and reproducible framework for the mapping of genetic diversity in time and across different geographic regions, and for the prioritization of virus variants of potential concern. We believe that the approach outlined in this study will contribute to relevant advances to current genomic surveillance methods.
Article activity feed
-
SciScore for 10.1101/2021.06.23.449558: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Ethics not detected. Sex as a biological variable not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Table 2: Resources
Software and Algorithms Sentences Resources A standalone Galaxy implementation is available at: http://corgat.cloud.ba.infn.it/galaxy under Tools/utilities for Haplogroup assignment. Galaxysuggested: (Galaxy, RRID:SCR_006281)Haplogroups were established by hierarchical clustering of phenetic profiles of presence/absence of high frequency alleles, by applying the hclust function from the R standard libraries (Maechler et al, 2019). hclustsuggested: (HCLUST, RRID:SCR_009154)Identification of sites under selection was performed by applying the MEME and FEL … SciScore for 10.1101/2021.06.23.449558: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Ethics not detected. Sex as a biological variable not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Table 2: Resources
Software and Algorithms Sentences Resources A standalone Galaxy implementation is available at: http://corgat.cloud.ba.infn.it/galaxy under Tools/utilities for Haplogroup assignment. Galaxysuggested: (Galaxy, RRID:SCR_006281)Haplogroups were established by hierarchical clustering of phenetic profiles of presence/absence of high frequency alleles, by applying the hclust function from the R standard libraries (Maechler et al, 2019). hclustsuggested: (HCLUST, RRID:SCR_009154)Identification of sites under selection was performed by applying the MEME and FEL methods, as implemented in the Hyphy package23, to the phylogeny and the concatenated alignment of protein-coding sequences. Hyphysuggested: (HyPhy, RRID:SCR_016162)Clustering of mutation patterns of SARS-CoV-2 lineages/HGs was performed by means of the Phenograph algorithm as implemented by the RPhenograph package47. Phenographsuggested: (Phenograph, RRID:SCR_016919)RPhenographsuggested: NoneResults from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:Additionally, we highlight possible limitations of Pango, the current standard for the classification and nomenclature of SARS-CoV-2, which may prevent the rapid and unsupervised identification of emergent “regional” genomic diversity. By applying a revised implementation of the strategy proposed in Chiara et al19, based on relaxed filters for the inclusion of low-quality genomic assemblies and on the incorporation of regional estimates of allele frequencies, we present a novel, completely automated system for the monitoring of this genomic diversity. Importantly, we observe that our revised approach can correctly associate one or more related haplogroups to all current Variants of Concern (B.1.1.7, P.1, B.1.617.2, B.1.351) and Variants of Interest defined by international health authorities (Table 3). Our approach requires, on average, 20 days between the availability of the first genomic sequence associated with a VOC/VOI and the formation of a corresponding HG. This timeframe is completely in line with that observed for the reporting of current VOCs and VOIs, although no manual intervention is required by our framework. Since B.1.1.7 is currently the most prevalent and rapidly expanding lineage of SARS-CoV-2, the observation that a substantial proportion of high frequency mutations of in this lineage, emerging at different geographic locations, is not captured by current nomenclature standards, bears important implications for the accurate and rapid tracking of other mutat...
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-