Data-driven approaches for genetic characterization of SARS-CoV-2 lineages
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
The genome of the Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2), the pathogen that causes coronavirus disease 2019 (COVID-19), has been sequenced at an unprecedented scale, leading to a tremendous amount of viral genome sequencing data. To understand the evolution of this virus in humans, and to assist in tracing infection pathways and designing preventive strategies, we present a set of computational tools that span phylogenomics, population genetics and machine learning approaches. To illustrate the utility of this toolbox, we detail an in depth analysis of the genetic diversity of SARS-CoV-2 in first year of the COVID-19 pandemic, using 329,854 high-quality consensus sequences published in the GISAID database during the pre-vaccination phase. We demonstrate that, compared to standard phylogenetic approaches, haplotype networks can be computed efficiently on much larger datasets, enabling real-time analyses. Furthermore, time series change of Tajima’s D provides a powerful metric of population expansion. Unsupervised learning techniques further highlight key steps in variant detection and facilitate the study of the role of this genomic variation in the context of SARS-CoV-2 infection, with Multiscale PHATE methodology identifying fine-scale structure in the SARS-CoV-2 genetic data that underlies the emergence of key lineages. The computational framework presented here is useful for real-time genomic surveillance of SARS-CoV-2 and could be applied to any pathogen that threatens the health of worldwide populations of humans and other organisms.
Article activity feed
-
SciScore for 10.1101/2021.09.28.462270: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources The phylogenetic tree is computed using FastTree v2.1.11 (47) using a GTR + Gamma model. FastTreesuggested: (FastTree, RRID:SCR_015501)The root-to-tip distance was computed using TempEst v1.5.3 (50) and tree visualization was made using ggtree (51). TempEstsuggested: (TempEst, RRID:SCR_017304)Further improvement can be done to the alignment by removing poorly aligned regions using Gblocks program (52). Gblockssuggested: (Gblocks, RRID:SCR_015945)Uniform Manifold Approximation and Projection (UMAP) is a python library that was also used on the data projected on the 20 first components of the … SciScore for 10.1101/2021.09.28.462270: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources The phylogenetic tree is computed using FastTree v2.1.11 (47) using a GTR + Gamma model. FastTreesuggested: (FastTree, RRID:SCR_015501)The root-to-tip distance was computed using TempEst v1.5.3 (50) and tree visualization was made using ggtree (51). TempEstsuggested: (TempEst, RRID:SCR_017304)Further improvement can be done to the alignment by removing poorly aligned regions using Gblocks program (52). Gblockssuggested: (Gblocks, RRID:SCR_015945)Uniform Manifold Approximation and Projection (UMAP) is a python library that was also used on the data projected on the 20 first components of the PCA using the default parameters of the algorithm. pythonsuggested: (IPython, RRID:SCR_001658)Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-