Predictive profiling of SARS-CoV-2 variants by deep mutational learning

This article has been Reviewed by the following groups

Read the full article

Abstract

The continual evolution of the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and the emergence of variants that show resistance to vaccines and neutralizing antibodies ( 1–4 ) threaten to prolong the coronavirus disease 2019 (COVID-19) pandemic ( 5 ). Selection and emergence of SARS-CoV-2 variants are driven in part by mutations within the viral spike protein and in particular the ACE2 receptor-binding domain (RBD), a primary target site for neutralizing antibodies. Here, we develop deep mutational learning (DML), a machine learning-guided protein engineering technology, which is used to interrogate a massive sequence space of combinatorial mutations, representing billions of RBD variants, by accurately predicting their impact on ACE2 binding and antibody escape. A highly diverse landscape of possible SARS-CoV-2 variants is identified that could emerge from a multitude of evolutionary trajectories. DML may be used for predictive profiling on current and prospective variants, including highly mutated variants such as omicron (B.1.1.529), thus supporting decision making for public heath as well as guiding the development of therapeutic antibody treatments and vaccines for COVID-19.

Article activity feed

  1. SciScore for 10.1101/2021.12.07.471580: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Antibodies
    SentencesResources
    Cells expressing RBD that maintained antibody-binding (IgG+/FLAG+) or showed a complete loss of antibody binding (escape) (IgG-/FLAG+) were sorted by FACS (BD Aria Fusion or Sony MA800 instrument).
    antibody-binding (IgG+/FLAG+
    suggested: None
    Antibody production and purification: Heavy chain and light chain inserts for REGN10933, REGN10987 (PDB: 6XDG) and LY-CoV16 (PDB: 7C01), LY-CoV555 (PDB: 7KMG) were cloned into pTwist transient expression vectors by Gibson Assembly.
    LY-CoV16
    suggested: None
    Cells were stained with biotinylated ACE2 or purified antibody as described above.
    ACE2
    suggested: None
    Experimental Models: Cell Lines
    SentencesResources
    30 mL cultures of Expi293 cells (Thermo, A14635) were transfected according to the manufacturer’s instructions.
    Expi293
    suggested: RRID:CVCL_D615)
    Recombinant DNA
    SentencesResources
    Cloning and expression of RBD mutagenesis libraries for yeast surface display: For libraries 2C and 2CE, synthetic single-stranded oligonucleotides (ssODNs) (Integrated DNA Technologies ultramers or oPools) were designed with degenerate codons spanning the region of interest and encoding the desired library diversity, with 30 bp overhangs on each end that were homologous to the yeast display plasmid pYD1.
    pYD1
    suggested: RRID:Addgene_73447)
    Experimental validation of selected RBD variants for ACE2-binding and antibody escape: Individual sequences for RBD variants were ordered as complementary forward and reverse primers (Integrated DNA Technologies) in 96-well plates A single round of annealing and extension was used to produce double-stranded DNA with 14-bp of homology at 5’ and 3’ ends to the pYD1-RBD entry vector, followed by Gibson Assembly with EcoRI digested vector.
    pYD1-RBD
    suggested: None
    Software and Algorithms
    SentencesResources
    Populations were pooled at the desired ratios and sequenced using Illumina 2 x 250 PE or 2 x 150 PE protocols (MiSeq or NovaSeq instruments).
    MiSeq
    suggested: (A5-miseq, RRID:SCR_012148)
    Processing of deep sequencing data, statistical analysis and plots: Data preprocessing: Sequencing reads were paired, quality trimmed and assembled using Geneious and BBDuk, with a quality threshold of qphred ≥ 25.
    Geneious
    suggested: (Geneious, RRID:SCR_010519)
    Statistical analysis and plots: Statistical analysis was performed using R 4.0.1 (6) and Python 3.8.5 (7).
    Python
    suggested: (IPython, RRID:SCR_001658)
    Graphics were generated using the ggplot2 3.3.3 (8), ComplexHeatmap 2.4.3 (9) pheatmap 1.0.12 (10), igraph 1.2.6 (11), RCy3 2.8.1 (12), stringr 1.4.0 (13), dplyr 1.0.6 (14), and RColorBrewer 1.1-2 (15) R package.
    ggplot2
    suggested: (ggplot2, RRID:SCR_014601)
    ComplexHeatmap
    suggested: (ComplexHeatmap, RRID:SCR_017270)
    Escape Networks: Network plots were generated using the igraph package 1.2.6 (11) and Cytoscape software 3.8.2 (16) with edges drawn between every pair of two amino acid sequences from ED 1 and 2, when the pair of sequences share a common mutation on amino acid level.
    igraph
    suggested: (igraph, RRID:SCR_019225)
    Cytoscape
    suggested: (Cytoscape, RRID:SCR_003032)
    Data was prepared and visualized using numpy (1.19.2), matplotlib (3.3.4), and pandas (1.2.4).
    numpy
    suggested: (NumPy, RRID:SCR_008633)
    matplotlib
    suggested: (MatPlotLib, RRID:SCR_008624)
    Random Forest (RF) and other benchmarking ML models were built using Scikit-Learn (0.24.2), a 80/20 train-test data split (random split) to train baseline models, and a 90/10 traintest data split (random split) for final RF and RNN models.
    Scikit-Learn
    suggested: (scikit-learn, RRID:SCR_002577)
    Structural Prediction of RBD variants by AlphaFold2: Structural predictions were generated with the Alphafold v2.1.0 public iPython notebook using residues 331-530 of the spike protein.
    iPython
    suggested: (IPython, RRID:SCR_001658)
    Results were visualized and aligned in PyMol v2.2.3 (21).
    PyMol
    suggested: (PyMOL, RRID:SCR_000305)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on pages 25 and 21. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.