Mapping the Evolutionary Space of SARS-CoV-2 Variants to Anticipate Emergence of Subvariants Resistant to COVID-19 Therapeutics

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

New sublineages of SARS-CoV-2 variants-of-concern (VOCs) continuously emerge with mutations in the spike glycoprotein. In most cases, the sublineage-defining mutations vary between the VOCs. It is unclear whether these differences reflect lineage-specific likelihoods for mutations at each spike position or the stochastic nature of their appearance. Here we show that SARS-CoV-2 lineages have distinct evolutionary spaces (a probabilistic definition of the sequence states that can be occupied by expanding virus subpopulations). This space can be accurately inferred from the patterns of amino acid variability at the whole-protein level. Robust networks of co-variable sites identify the highest-likelihood mutations in new VOC sublineages and predict remarkably well the emergence of subvariants with resistance mutations to COVID-19 therapeutics. Our studies reveal the contribution of low frequency variant patterns at heterologous sites across the protein to accurate prediction of the changes at each position of interest.

Article activity feed

  1. SciScore for 10.1101/2022.02.01.478697: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Recombinant DNA
    SentencesResources
    Thus, each cluster is assigned a 1,273-feature vector that describes the absence or presence of volatility at each position of spike.
    1,273-feature
    suggested: None
    Software and Algorithms
    SentencesResources
    The following processing steps and analyses were performed within the Galaxy web platform (46).
    Galaxy
    suggested: (Galaxy, RRID:SCR_006281)
    To facilitate alignment of sequences that contain more nucleotides than those corresponding to the spike gene, we trimmed excess bases with Cutadapt, using 5’-ATGTTTGTT-3’ and 3’-TACACATAA-5 “adapters” that flank the spike gene.
    Cutadapt
    suggested: (cutadapt, RRID:SCR_011841)
    Sequences that cause frameshift mutations were excluded using Transeq.
    Transeq
    suggested: (Transeq, RRID:SCR_015647)
    Nucleotide sequences were then translated with Transeq and amino acid sequences were aligned with MAFFT, FFT-NS-2 (47).
    MAFFT
    suggested: (MAFFT, RRID:SCR_011811)
    Phylogenetic tree construction and analyses: A maximum-likelihood tree was constructed for the aligned compressed nucleotide sequences using the generalized time-reversible model with CAT approximation (GTR-CAT) nucleotide evolution model with FASTTREE (49).
    FASTTREE
    suggested: (FastTree, RRID:SCR_015501)
    To divide the tree into “Groups” of sequences, we used an in-house code in Python (see link to GitHub repository in the Data Availability section).
    Python
    suggested: (IPython, RRID:SCR_001658)
    Network structure was visualized using the open-source software Gephi (51).
    Gephi
    suggested: (Gephi, RRID:SCR_004293)
    To determine robustness of network structure, we randomly deleted 10, 20 or 30 percent of all edges for each of the networks, and network topological properties were computed using the Cytoscape Network Analyzer tool (52).
    Cytoscape
    suggested: (Cytoscape, RRID:SCR_003032)
    Maximum Likelihood computations of dN and dS were conducted using the HyPhy software package (56).
    HyPhy
    suggested: (HyPhy, RRID:SCR_016162)

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.