Synonymous mutations and the molecular evolution of SARS-CoV-2 origins

This article has been Reviewed by the following groups

Read the full article

Abstract

Human severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is most closely related, by average genetic distance, to two coronaviruses isolated from bats, RaTG13 and RmYN02. However, there is a segment of high amino acid similarity between human SARS-CoV-2 and a pangolin-isolated strain, GD410721, in the receptor-binding domain (RBD) of the spike protein, a pattern that can be caused by either recombination or by convergent amino acid evolution driven by natural selection. We perform a detailed analysis of the synonymous divergence, which is less likely to be affected by selection than amino acid divergence, between human SARS-CoV-2 and related strains. We show that the synonymous divergence between the bat-derived viruses and SARS-CoV-2 is larger than between GD410721 and SARS-CoV-2 in the RBD, providing strong additional support for the recombination hypothesis. However, the synonymous divergence between pangolin strain and SARS-CoV-2 is also relatively high, which is not consistent with a recent recombination between them, instead, it suggests a recombination into RaTG13. We also find a 14-fold increase in the dN/dS ratio from the lineage leading to SARS-CoV-2 to the strains of the current pandemic, suggesting that the vast majority of nonsynonymous mutations currently segregating within the human strains have a negative impact on viral fitness. Finally, we estimate that the time to the most recent common ancestor of SARS-CoV-2 and RaTG13 or RmYN02 based on synonymous divergence is 51.71 years (95% CI, 28.11–75.31) and 37.02 years (95% CI, 18.19–55.85), respectively.

Article activity feed

  1. SciScore for 10.1101/2020.04.20.052019: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    BLAST searches: Sequences for blast databases were downloaded on March 26, 2020 from the following sources: EMBL nucleotide libraries for virus (ftp://ftp.ebi.ac.uk/pub/databases/embl/release/std), NCBI Virus Genomes
    BLAST
    suggested: (BLASTX, RRID:SCR_001653)
    , NCBI Influenza Genomes (ftp://ftp.ncbi.nlm.nih.gov/genomes/INFLUENZA/), all Whole Genome Shotgun (https://www.ncbi.nlm.nih.gov/genbank/wgs/) assemblies under taxonomy ID 10239, along with GISAID Epiflu and EpiCoV databases.
    Influenza Genomes
    suggested: None
    https://www.ncbi.nlm.nih.gov/genbank/wgs/
    suggested: (Whole Genome Shotgun (WGS Project, RRID:SCR_016637)
    The genome alignments were performed using MAFFT (v7.450) (Katoh and Standley 2013) with parameters “--maxiterate 1000 --localpair”.
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    The coding sequences of each gene were aligned using PRANK (Loytynoja 2014) (v.170427) with parameters “-codon -F”.
    PRANK
    suggested: (prank, RRID:SCR_017228)
    The NJ tree was estimated using the ’neighbor’ software from the PHYLIP package (Felsenstein 2009).
    PHYLIP
    suggested: (PHYLIP, RRID:SCR_006244)
    Estimation of sequence divergence in 300-bp windows: dN and dS were estimated using two different methods implemented in the PAML package (Yang 2007)
    PAML
    suggested: (PAML, RRID:SCR_014932)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.

  2. SciScore for 10.1101/2020.03.02.973255: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    We used TRIMMOMATIC (59) to trim the reads of those samples to 100 bp, with the following command line: We aligned the FASTQ files using Burrows-Wheeler Aligner (BWA) (60) using the official sequence of SARS-CoV-2 (NC_045512. 2) as reference genome.
    TRIMMOMATIC
    suggested: (Trimmomatic, RRID:SCR_011848)
    After the alignments BAM files were sorted them using SAMtools (
    SAMtools
    suggested: (SAMTOOLS, RRID:SCR_002105)
    Due to a high error rate reported by QUALIMAP, samples SRR11059943 and SRR10971381 have been removed from the analysis.
    QUALIMAP
    suggested: (QualiMap, RRID:SCR_001209)
    To avoid potential artifacts due to strand bias, we used the AS_StrandOddsRatio parameter calculated following GATK guidelines ((https://gatk.broadinstitute.org/hc/en-us/articles/360040507111-AS-StrandOddsRatio), and any mutation with a AS_StrandOddsRatio > 4 has been removed from the dataset.
    GATK
    suggested: (GATK, RRID:SCR_001876)
    Bcftools (61) has been used to calculate total allelic depths on the forward and reverse strand (ADF, ADR) for AS_StrandOddsRatio calculation, with the following command line: Mutations common to the datasets generated by Reditools 2 and JACUSA were considered (n = 910, Fig.
    Reditools
    suggested: (REDItools, RRID:SCR_012133)
    Data manipulation: R packages (Biostrings, rsamtools, ggseqlogo ggplot2, splitstackshape) and custom Perl scripts were used to handle the data.
    ggplot2
    suggested: (ggplot2, RRID:SCR_014601)
    SARS-CoV-2, SARS and MERS genomic data were prepared for the Logi alignment using the GenomicRanges R package (63)
    GenomicRanges
    suggested: (GenomicRanges, RRID:SCR_000025)
    Consensus sequences of SARS and MERS genomes were built using the “cons” tool from the EMBOSS suite (http://bioinfo.nhri.org.tw/gui/) with default settings.
    EMBOSS
    suggested: (EMBOSS, RRID:SCR_008493)
    SARS-CoV-2 genomic sequences were downloaded from GISAID (https://www.gisaid.org/) and aligned with MUSCLE (64).
    MUSCLE
    suggested: (MUSCLE, RRID:SCR_011812)

    Results from OddPub: Thank you for sharing your data.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.