Unsupervised classification of SARS-CoV-2 genomic sequences uncovers hidden genetic diversity and suggests an efficient strategy for genomic surveillance

Abstract

Accurate and timely monitoring of emerging genomic diversity is crucial for limiting the spread of potentially more transmissible/pathogenic strains of SARS-CoV-2. At the time of writing, over 1.8M distinct viral genome sequences have been made publicly available, and a sophisticated nomenclature system based on phylogenetic evidence and expert manual curation has allowed the relatively rapid classification of emerging lineages of potential concern.

Here, we propose a complementary approach that integrates fine-grained spatiotemporal estimates of allele frequency with unsupervised clustering of viral haplotypes, and demonstrate that multiple highly frequent genetic variants, arising within large and/or rapidly expanding SARS-CoV-2 lineages, have highly biased geographic distributions and are not adequately captured by current SARS-CoV-2 nomenclature standards.

Our results advocate a partial revision of current methods used to track SARS-CoV-2 genomic diversity and highlight the importance of the application of strategies based on the systematic analysis and integration of regional data. Here we provide a complementary, completely automated and reproducible framework for the mapping of genetic diversity in time and across different geographic regions, and for the prioritization of virus variants of potential concern. We believe that the approach outlined in this study will contribute to relevant advances to current genomic surveillance methods.

SciScore for 10.1101/2021.06.23.449558: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
A standalone Galaxy implementation is available at: http://corgat.cloud.ba.infn.it/galaxy under Tools/utilities for Haplogroup assignment.	Galaxy suggested: (Galaxy, RRID:SCR_006281)
Haplogroups were established by hierarchical clustering of phenetic profiles of presence/absence of high frequency alleles, by applying the hclust function from the R standard libraries (Maechler et al, 2019).	hclust suggested: (HCLUST, RRID:SCR_009154)
Identification of sites under selection was performed by applying the MEME and FEL …

SciScore for 10.1101/2021.06.23.449558: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
A standalone Galaxy implementation is available at: http://corgat.cloud.ba.infn.it/galaxy under Tools/utilities for Haplogroup assignment.	Galaxy suggested: (Galaxy, RRID:SCR_006281)
Haplogroups were established by hierarchical clustering of phenetic profiles of presence/absence of high frequency alleles, by applying the hclust function from the R standard libraries (Maechler et al, 2019).	hclust suggested: (HCLUST, RRID:SCR_009154)
Identification of sites under selection was performed by applying the MEME and FEL methods, as implemented in the Hyphy package23, to the phylogeny and the concatenated alignment of protein-coding sequences.	Hyphy suggested: (HyPhy, RRID:SCR_016162)
Clustering of mutation patterns of SARS-CoV-2 lineages/HGs was performed by means of the Phenograph algorithm as implemented by the RPhenograph package47.	Phenograph suggested: (Phenograph, RRID:SCR_016919) RPhenograph suggested: None

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Additionally, we highlight possible limitations of Pango, the current standard for the classification and nomenclature of SARS-CoV-2, which may prevent the rapid and unsupervised identification of emergent “regional” genomic diversity. By applying a revised implementation of the strategy proposed in Chiara et al19, based on relaxed filters for the inclusion of low-quality genomic assemblies and on the incorporation of regional estimates of allele frequencies, we present a novel, completely automated system for the monitoring of this genomic diversity. Importantly, we observe that our revised approach can correctly associate one or more related haplogroups to all current Variants of Concern (B.1.1.7, P.1, B.1.617.2, B.1.351) and Variants of Interest defined by international health authorities (Table 3). Our approach requires, on average, 20 days between the availability of the first genomic sequence associated with a VOC/VOI and the formation of a corresponding HG. This timeframe is completely in line with that observed for the reporting of current VOCs and VOIs, although no manual intervention is required by our framework. Since B.1.1.7 is currently the most prevalent and rapidly expanding lineage of SARS-CoV-2, the observation that a substantial proportion of high frequency mutations of in this lineage, emerging at different geographic locations, is not captured by current nomenclature standards, bears important implications for the accurate and rapid tracking of other mutat...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Unsupervised classification of SARS-CoV-2 genomic sequences uncovers hidden genetic diversity and suggests an efficient strategy for genomic surveillance

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.