SARS-CoV-2 sequence typing, evolution and signatures of selection using CoVa, a Python-based command-line utility

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The current global pandemic COVID-19, caused by SARS-CoV-2, has resulted in millions of infections worldwide in a few months. Global efforts to tackle this situation have produced a tremendous body of genomic data, which can be used for tracing transmission routes, characterization of isolates, and monitoring variants with potential for unusual virulence. Several groups have analyzed these genomes using different approaches. However, as new data become available, the research community needs a pipeline to perform a set of routine analyses, that can quickly incorporate new genome sequences and update the analysis reports. We developed a programmatic tool, CoVa, with this objective. It is a fast, accurate and user-friendly utility to perform a variety of genome analyses on hundreds of SARS-CoV-2 sequences. Using CoVa, we define a modified sequence typing nomenclature and identify sites under positive selection. Further analysis identified some peptides and sites showing geographical patterns of selection. Specifically, we show differences in sequence type distribution between sequences from India and those from the rest of the world. We also show that several sites show signatures of positive selection uniquely in sequences from India. Preliminary evolutionary analysis, using features that will be incorporated into CoVa in the near future, show a mutation rate of 7.4 × 10 −4 substitutions/site/year, confirm a temporal signal with a November 2019 origin of SARS-CoV-2, and a heterogeneity in the geographical distribution of Indian samples.

Article activity feed

  1. SciScore for 10.1101/2020.06.09.082834: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    NCBI Refseq accession NC_045512 is used as the variant calling reference in the pipeline.
    Refseq
    suggested: (RefSeq, RRID:SCR_003496)
    Similarly, CoVa limits split-support computation in FastTree to 100 runs for both speed and memory optimization without compromising on accuracy.
    FastTree
    suggested: (FastTree, RRID:SCR_015501)
    One of the key advantages of using MAFFT in CoVa is its ability to quickly incorporate new sequences to an existing MSA (6).
    CoVa
    suggested: (COVA, RRID:SCR_005175)
    Evolution of SARS-CoV-2: Two multiple sequence alignments built using - 1) only Indian samples and 2) samples across the globe (excluding Indian samples) were merged together as a single multiple sequence alignment (MSA) using the mafft --merge option (MAFFT reference).
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    This pipeline was created using python 2.7.
    python
    suggested: (IPython, RRID:SCR_001658)
    We used TempEst (12) to find the root of the tree such that it optimised for the temporal signal by trying all possible roots and chose the one that minimised the mean of the square of the residuals.
    TempEst
    suggested: (TempEst, RRID:SCR_017304)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on page 12. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.