Large-scale population analysis of SARS-CoV-2 whole genome sequences reveals host-mediated viral evolution with emergence of mutations in the viral Spike protein associated with elevated mortality rates

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

We aimed to further characterize and analyze in depth intra-host variation and founder variants of SARS-CoV-2 worldwide up until August 2020, by examining in excess of 94,000 SARS-CoV-2 viral sequences in order to understand SARS-CoV-2 variant evolution, how these variants arose and identify any increased mortality associated with these variants.

Methods and Findings

We combined worldwide sequencing data from GISAID and Sequence Read Archive (SRA) repositories and discovered SARS-CoV-2 hypermutation occurring in less than 2% of COVID19 patients, likely caused by host mechanisms involved APOBEC3G complexes and intra-host microdiversity. Most of this intra-host variation occurring in SARS-CoV-2 are predicted to change viral proteins with defined variant signatures, demonstrating that SARS-CoV-2 can be actively shaped by the host immune system to varying degrees. At the global population level, several SARS-CoV-2 proteins such as Nsp2, 3C-like proteinase, ORF3a and ORF8 are under active evolution, as evidenced by their increased πN/ πS ratios per geographical region. Importantly, two emergent variants: V1176F in co-occurrence with D614G mutation in the viral Spike protein, and S477N, located in the Receptor Binding Domain (RBD) of the Spike protein, are associated with high fatality rates and are increasingly spreading throughout the world. The S477N variant arose quickly in Australia and experimental data support that this variant increases Spike protein fitness and its binding to ACE2.

Conclusions

SARS-CoV-2 is evolving non-randomly, and human hosts shape emergent variants with positive fitness that can easily spread into the population. We propose that V1776F and S477N variants occurring in the Spike protein are two novel mutations occurring in SARS-CoV-2 and may pose significant public health concerns in the future.

Author Summary

We have developed an efficient bioinformatics pipeline that has allowed us obtain the most complete picture to date of how the SARS-CoV-2 virus has changed during the last eight month global pandemic and will continue to change in the near future. We characterized the importance of the host immune response in shaping viral variants at different degrees, evidenced by hypermutation responses on SARS-CoV-2 in less than 2% of infections and positive selection of several viral proteins by geographical region. We underscore how human hosts are shaping emergent variants with positive fitness that can easily spread into the population, evidenced by variants V1176F and S477N, located in the stalk and receptor binding domains of the Spike protein, respectively. Variant V1176 is associated with increased mortality rates in Brazil and variant S477N is associated with increased mortality rates over the world. In addition, it has been experimentally demonstrated that S477N variant increase fitness of Spike protein and its binding with ACE2, thus predicting to increase virulence of SARS-CoV-2. This limits the concept of ‘herd immunity’ proposals and re-emphasize the need to limit the spread of the virus to avoid emergence of more virulent forms of SARS-CoV-2 that can spread worldwide.

Article activity feed

  1. SciScore for 10.1101/2020.10.23.20218511: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Organisms/Strains
    SentencesResources
    Then, we modelled the variant and calculated the free energy upon aminoacid changes as follows: foldx --command=BuildModel --pdb=name-of-protein.pdb --mutant-file=individual_list.txt --ionStrength=0.05 --pH=7 --water=CRYSTAL --vdwDesign=2 --out-pdb=true --pdbHydrogens=false --numberOfRuns=30, where individual_list.txt contain the aminoacid change (as example, for a serine/asparagine change occurring in the three chains of the spike protein trimmer: SA477N,SB477N,SC477N;).
    SB477N
    suggested: None
    Software and Algorithms
    SentencesResources
    The resulting BAM files were sorted and indexed by using Samtools [22].
    Samtools
    suggested: (SAMTOOLS, RRID:SCR_002105)
    Then, the Jacquard program (https://jacquard.readthedocs.io/en/v0.42/index.html) in python environment [24] merges every VCF file containing variants associated to each bam file into a single VCF file, containing aggregated variants from all genomes.
    python
    suggested: (IPython, RRID:SCR_001658)
    Then, Minimap2 aligner with preset -ax asm5 [21] align every FASTA genome against SARS-CoV-2 reference genome.
    Minimap2
    suggested: (Minimap2, RRID:SCR_018550)
    Freebayes variant caller with --min-alternate-count 1 (C 1) option (https://github.com/ekg/freebayes) perform variant calling on each BAM file, outputting variants in VCF format.
    https://github.com/ekg/freebayes
    suggested: (FreeBayes, RRID:SCR_010761)
    Figure 1A graph was constructed by using variants per genome, reported in the output file “logfile_variants_GISAID_freebayes”, inputted into the GraphPad Prism 8 software.
    GraphPad
    suggested: (GraphPad Prism, RRID:SCR_002798)
    SnpEff annotation: Merged variants from GISAID genomes (n=76563) were annotated by using in a repurposed version of SnpEff program, available in the Galaxy server [30-32].
    SnpEff
    suggested: (SnpEff, RRID:SCR_005191)
    Then, we parsed genomes and associated metadata by country (in particular, deceased and released cases) by using a combination of standard UNIX tools, vcflib (https://github.com/vcflib/vcflib) and BEDOPS [34].
    BEDOPS
    suggested: (BEDOPS, RRID:SCR_012865)
    After these steps, we uploaded to the Galaxy server (https://usegalaxy.org/) the resulting output per country (Deceased-Released.subset file) [31, 35] and we performed Fisher’s exact test to identified variants with a significant difference in the viral frequencies between the groups (snpFreq program, https://rdrr.io/github/lvclark/SNPfreq/).
    Galaxy
    suggested: (Galaxy, RRID:SCR_006281)
    P values from Fisher’s exact test were converted with to negative logarithm in base 10 by using R version 3.6.3 (https://www.r-project.org/).
    https://www.r-project.org/
    suggested: (R Project for Statistical Computing, RRID:SCR_001905)
    Fasttree version 2.1 [38] was used to infer an approximately-maximum-likelihood phylogenetic tree from the aligned sequences in fasta format, by using heuristic neighbor-joining clustering method [38] and the Jukes-Cantor model of evolution [39].
    Fasttree
    suggested: (FastTree, RRID:SCR_015501)
    These models were generated by the C-I-TASSER pipeline [42-45].
    C-I-TASSER
    suggested: None
    The full Spike protein trimmer was obtained from I-TASSER and the variant V1176F was modelled by using Foldx5, as previously described in the Free energy estimation calculations section (--command=BuildModel, first outputted model).
    I-TASSER
    suggested: (I-TASSER, RRID:SCR_014627)
    Statistical analysis: All statistical analyses were carried out by using GraphPad Prism 8 software (https://www.graphpad.com/scientific-software/prism/).
    GraphPad Prism
    suggested: (GraphPad Prism, RRID:SCR_002798)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on page 49. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.