SARS-CoV-2 sequence typing, evolution and signatures of selection using CoVa, a Python-based command-line utility

Abstract

The current global pandemic COVID-19, caused by SARS-CoV-2, has resulted in millions of infections worldwide in a few months. Global efforts to tackle this situation have produced a tremendous body of genomic data, which can be used for tracing transmission routes, characterization of isolates, and monitoring variants with potential for unusual virulence. Several groups have analyzed these genomes using different approaches. However, as new data become available, the research community needs a pipeline to perform a set of routine analyses, that can quickly incorporate new genome sequences and update the analysis reports. We developed a programmatic tool, CoVa, with this objective. It is a fast, accurate and user-friendly utility to perform a variety of genome analyses on hundreds of SARS-CoV-2 sequences. Using CoVa, we define a modified sequence typing nomenclature and identify sites under positive selection. Further analysis identified some peptides and sites showing geographical patterns of selection. Specifically, we show differences in sequence type distribution between sequences from India and those from the rest of the world. We also show that several sites show signatures of positive selection uniquely in sequences from India. Preliminary evolutionary analysis, using features that will be incorporated into CoVa in the near future, show a mutation rate of 7.4 × 10 ⁻⁴ substitutions/site/year, confirm a temporal signal with a November 2019 origin of SARS-CoV-2, and a heterogeneity in the geographical distribution of Indian samples.

SciScore for 10.1101/2020.06.09.082834: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
NCBI Refseq accession NC_045512 is used as the variant calling reference in the pipeline.	Refseq suggested: (RefSeq, RRID:SCR_003496)
Similarly, CoVa limits split-support computation in FastTree to 100 runs for both speed and memory optimization without compromising on accuracy.	FastTree suggested: (FastTree, RRID:SCR_015501)
One of the key advantages of using MAFFT in CoVa is its ability to quickly incorporate new sequences to an existing MSA (6).	CoVa suggested: (COVA, RRID:SCR_005175)
Evolution of SARS-CoV-2: Two multiple sequence alignments built using - 1) only Indian samples and 2) samples …

SciScore for 10.1101/2020.06.09.082834: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
NCBI Refseq accession NC_045512 is used as the variant calling reference in the pipeline.	Refseq suggested: (RefSeq, RRID:SCR_003496)
Similarly, CoVa limits split-support computation in FastTree to 100 runs for both speed and memory optimization without compromising on accuracy.	FastTree suggested: (FastTree, RRID:SCR_015501)
One of the key advantages of using MAFFT in CoVa is its ability to quickly incorporate new sequences to an existing MSA (6).	CoVa suggested: (COVA, RRID:SCR_005175)
Evolution of SARS-CoV-2: Two multiple sequence alignments built using - 1) only Indian samples and 2) samples across the globe (excluding Indian samples) were merged together as a single multiple sequence alignment (MSA) using the mafft --merge option (MAFFT reference).	MAFFT suggested: (MAFFT, RRID:SCR_011811)
This pipeline was created using python 2.7.	python suggested: (IPython, RRID:SCR_001658)
We used TempEst (12) to find the root of the tree such that it optimised for the temporal signal by trying all possible roots and chose the one that minimised the mean of the square of the residuals.	TempEst suggested: (TempEst, RRID:SCR_017304)

Results from OddPub: Thank you for sharing your code.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on page 12. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Read the original source

SARS-CoV-2 sequence typing, evolution and signatures of selection using CoVa, a Python-based command-line utility

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

DIVERSITY AND CLINICAL CORRELATIONS OF SARS-CoV-2 VARIANT DURING THE INTRODUCTION OF THE DELTA VARIANT IN GUATEMALA

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DIVERSITY AND CLINICAL CORRELATIONS OF SARS-CoV-2 VARIANT DURING THE INTRODUCTION OF THE DELTA VARIANT IN GUATEMALA

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.