Unsupervised clustering analysis of SARS-Cov-2 population structure reveals six major subtypes at early stage across the world

Abstract

Identifying the population structure of the newly emerged coronavirus SARS-CoV-2 has significant potential to inform public health management and diagnosis. As SARS-CoV-2 sequencing data accrued, grouping them into clusters is important for organizing the landscape of the population structure of the virus. Due to the limited prior information on the newly emerged coronavirus, we utilized four different clustering algorithms to group 16,873 SARS-CoV-2 strains, which automatically enables the identification of spatial structure for SARS-CoV-2. A total of six distinct genomic clusters were identified using mutation profiles as input features. Comparison of the clustering results reveals that the four algorithms produced highly consistent results, but the state-of-the-art unsupervised deep learning clustering algorithm performed best and produced the smallest intra-cluster pairwise genetic distances. The varied proportions of the six clusters within different continents revealed specific geographical distributions. In particular, our analysis found that Oceania was the only continent on which the strains were dispersively distributed into six clusters. In summary, this study provides a concrete framework for the use of clustering methods to study the global population structure of SARS-CoV-2. In addition, clustering methods can be used for future studies of variant population structures in specific regions of these fast-growing viruses.

Article activity feed

SciScore for 10.1101/2020.09.04.283358: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Multiple sequence alignments and pairwise alignments were constructed using CLUSTALW 2.1 (21).	CLUSTALW suggested: (ClustalW, RRID:SCR_017277)
We used substitutions as features to reconstruct the phylogenetic tree using FastTree 2 (22).	FastTree suggested: (FastTree, RRID:SCR_015501)
The phylogeny is rooted following Nextstrain pipeline using FigTree v1.4.4 (23).	FigTree suggested: (FigTree, RRID:SCR_008515)
Other figures and statistical analyses were generated by the ggplot2 library in R 3.6.1, the seaborn package in Python 3.7.6 and GraphPad Prism 8.0.2.	ggplot2 suggested: (ggplot2, RRID:SCR_014601)

SciScore for 10.1101/2020.09.04.283358: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Multiple sequence alignments and pairwise alignments were constructed using CLUSTALW 2.1 (21).	CLUSTALW suggested: (ClustalW, RRID:SCR_017277)
We used substitutions as features to reconstruct the phylogenetic tree using FastTree 2 (22).	FastTree suggested: (FastTree, RRID:SCR_015501)
The phylogeny is rooted following Nextstrain pipeline using FigTree v1.4.4 (23).	FigTree suggested: (FigTree, RRID:SCR_008515)
Other figures and statistical analyses were generated by the ggplot2 library in R 3.6.1, the seaborn package in Python 3.7.6 and GraphPad Prism 8.0.2.	ggplot2 suggested: (ggplot2, RRID:SCR_014601) GraphPad Prism suggested: (GraphPad Prism, RRID:SCR_002798)
The models were implemented using the Python package sklearn with the KMeans function, AgglomerativeClustering function and Birch function, respectively.	Python suggested: (IPython, RRID:SCR_001658)
Inferring positive/purifying selection of individual sites: To test which position was under selective pressure, we used a set of programs available in HyPhy (28) to calculate nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis to infer pervasive selection.	HyPhy suggested: (HyPhy, RRID:SCR_016162)

Results from OddPub: Thank you for sharing your data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Read the original source

Sofia Herrera Agüero
Aldo Sosa
Alexander Martínez
Ambar Moreno
César Roberto Conde Pereira
Claudia Gonzalez
Claudio Soto Garita
Daniel Ulate
Estela Cordero-Laurent
Hebleen Brenes
Isaac Miguel Sánchez
Jairo Mendez-Rico
Jessica Góndola
Jose Arturo Molina-Mora
Juliana Leite
Leticia Franco
Linda Mendoza
Lionel Gresh
Lucia De La Cruz
Mitzi Castro Paz
Monica Barahona
Naomi Iihoshi
Oris Chavarria
Priscila Born
Ruby Melany Aguillón
Ruth Carolina Vasquez Cordova
Selene Gonzalez
Sofia Carolina Alvarado Silva
Xochitl Sandoval López
Yvonne Imbert
Francisco Duarte-Martínez

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Pulchérie Pelembi
Philippe Colson
Alain Farra
Ornella Anne Sibiro-Demi
Christian Noël Malaka
Aurélia Kwasiborski
Véronique Hourdel
Gilles Landry Ngaya
Romaric Nzoumbou-Boko
Jean-Claude Manuguerra
Emmanuel Ryvalin Nakoune-Yandoko
Guy VERNET
Bernard La Scola
Valérie Caro
Alexandre Manirakiza

Reemergence of chikungunya in Mauritius driven by a novel lineage with pandemic potential

Tulio de Oliveira
Magalutcheemee Ramuth
Houriiyah Tegally
Ashvin Ubheeram
Yajna Ramphal
Diana Iyaloo
Lavanya Singh
Lucious Chabuka
Eduan Wilkinson
Monika Moir
Jenicca Poongavanan
Graeme Dor
Hastings Musopole
Tomasz Sanko
Stepfan de Villiers
Khouaïldi Bin Elahee
Baboo Bahadoor
Mahmad Khodabocus
Ashwamed Dinassing
Cheryl Baxter
Richard Lessells
Janaki Sonoo

Unsupervised clustering analysis of SARS-Cov-2 population structure reveals six major subtypes at early stage across the world

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Reemergence of chikungunya in Mauritius driven by a novel lineage with pandemic potential

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Reemergence of chikungunya in Mauritius driven by a novel lineage with pandemic potential