Predictive profiling of SARS-CoV-2 variants by deep mutational learning

Abstract

The continual evolution of the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and the emergence of variants that show resistance to vaccines and neutralizing antibodies ( 1–4 ) threaten to prolong the coronavirus disease 2019 (COVID-19) pandemic ( 5 ). Selection and emergence of SARS-CoV-2 variants are driven in part by mutations within the viral spike protein and in particular the ACE2 receptor-binding domain (RBD), a primary target site for neutralizing antibodies. Here, we develop deep mutational learning (DML), a machine learning-guided protein engineering technology, which is used to interrogate a massive sequence space of combinatorial mutations, representing billions of RBD variants, by accurately predicting their impact on ACE2 binding and antibody escape. A highly diverse landscape of possible SARS-CoV-2 variants is identified that could emerge from a multitude of evolutionary trajectories. DML may be used for predictive profiling on current and prospective variants, including highly mutated variants such as omicron (B.1.1.529), thus supporting decision making for public heath as well as guiding the development of therapeutic antibody treatments and vaccines for COVID-19.

SciScore for 10.1101/2021.12.07.471580: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Antibodies
Sentences	Resources
Cells expressing RBD that maintained antibody-binding (IgG+/FLAG+) or showed a complete loss of antibody binding (escape) (IgG-/FLAG+) were sorted by FACS (BD Aria Fusion or Sony MA800 instrument).	antibody-binding (IgG+/FLAG+ suggested: None
Antibody production and purification: Heavy chain and light chain inserts for REGN10933, REGN10987 (PDB: 6XDG) and LY-CoV16 (PDB: 7C01), LY-CoV555 (PDB: 7KMG) were cloned into pTwist transient expression vectors by Gibson Assembly.	LY-CoV16 suggested: None
Cells were stained with biotinylated ACE2 or purified antibody as described above.	ACE2 suggested: None
…

SciScore for 10.1101/2021.12.07.471580: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Antibodies
Sentences	Resources
Cells expressing RBD that maintained antibody-binding (IgG+/FLAG+) or showed a complete loss of antibody binding (escape) (IgG-/FLAG+) were sorted by FACS (BD Aria Fusion or Sony MA800 instrument).	antibody-binding (IgG+/FLAG+ suggested: None
Antibody production and purification: Heavy chain and light chain inserts for REGN10933, REGN10987 (PDB: 6XDG) and LY-CoV16 (PDB: 7C01), LY-CoV555 (PDB: 7KMG) were cloned into pTwist transient expression vectors by Gibson Assembly.	LY-CoV16 suggested: None
Cells were stained with biotinylated ACE2 or purified antibody as described above.	ACE2 suggested: None
Experimental Models: Cell Lines
Sentences	Resources
30 mL cultures of Expi293 cells (Thermo, A14635) were transfected according to the manufacturer’s instructions.	Expi293 suggested: RRID:CVCL_D615)
Recombinant DNA
Sentences	Resources
Cloning and expression of RBD mutagenesis libraries for yeast surface display: For libraries 2C and 2CE, synthetic single-stranded oligonucleotides (ssODNs) (Integrated DNA Technologies ultramers or oPools) were designed with degenerate codons spanning the region of interest and encoding the desired library diversity, with 30 bp overhangs on each end that were homologous to the yeast display plasmid pYD1.	pYD1 suggested: RRID:Addgene_73447)
Experimental validation of selected RBD variants for ACE2-binding and antibody escape: Individual sequences for RBD variants were ordered as complementary forward and reverse primers (Integrated DNA Technologies) in 96-well plates A single round of annealing and extension was used to produce double-stranded DNA with 14-bp of homology at 5’ and 3’ ends to the pYD1-RBD entry vector, followed by Gibson Assembly with EcoRI digested vector.	pYD1-RBD suggested: None
Software and Algorithms
Sentences	Resources
Populations were pooled at the desired ratios and sequenced using Illumina 2 x 250 PE or 2 x 150 PE protocols (MiSeq or NovaSeq instruments).	MiSeq suggested: (A5-miseq, RRID:SCR_012148)
Processing of deep sequencing data, statistical analysis and plots: Data preprocessing: Sequencing reads were paired, quality trimmed and assembled using Geneious and BBDuk, with a quality threshold of qphred ≥ 25.	Geneious suggested: (Geneious, RRID:SCR_010519)
Statistical analysis and plots: Statistical analysis was performed using R 4.0.1 (6) and Python 3.8.5 (7).	Python suggested: (IPython, RRID:SCR_001658)
Graphics were generated using the ggplot2 3.3.3 (8), ComplexHeatmap 2.4.3 (9) pheatmap 1.0.12 (10), igraph 1.2.6 (11), RCy3 2.8.1 (12), stringr 1.4.0 (13), dplyr 1.0.6 (14), and RColorBrewer 1.1-2 (15) R package.	ggplot2 suggested: (ggplot2, RRID:SCR_014601) ComplexHeatmap suggested: (ComplexHeatmap, RRID:SCR_017270)
Escape Networks: Network plots were generated using the igraph package 1.2.6 (11) and Cytoscape software 3.8.2 (16) with edges drawn between every pair of two amino acid sequences from ED 1 and 2, when the pair of sequences share a common mutation on amino acid level.	igraph suggested: (igraph, RRID:SCR_019225) Cytoscape suggested: (Cytoscape, RRID:SCR_003032)
Data was prepared and visualized using numpy (1.19.2), matplotlib (3.3.4), and pandas (1.2.4).	numpy suggested: (NumPy, RRID:SCR_008633) matplotlib suggested: (MatPlotLib, RRID:SCR_008624)
Random Forest (RF) and other benchmarking ML models were built using Scikit-Learn (0.24.2), a 80/20 train-test data split (random split) to train baseline models, and a 90/10 traintest data split (random split) for final RF and RNN models.	Scikit-Learn suggested: (scikit-learn, RRID:SCR_002577)
Structural Prediction of RBD variants by AlphaFold2: Structural predictions were generated with the Alphafold v2.1.0 public iPython notebook using residues 331-530 of the spike protein.	iPython suggested: (IPython, RRID:SCR_001658)
Results were visualized and aligned in PyMol v2.2.3 (21).	PyMol suggested: (PyMOL, RRID:SCR_000305)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: Please consider improving the rainbow (“jet”) colormap(s) used on pages 25 and 21. At least one figure is not accessible to readers with colorblindness and/or is not true to the data, i.e. not perceptually uniform.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Predictive profiling of SARS-CoV-2 variants by deep mutational learning

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

DIVERSITY AND CLINICAL CORRELATIONS OF SARS-CoV-2 VARIANT DURING THE INTRODUCTION OF THE DELTA VARIANT IN GUATEMALA

Global Genomic Surveillance Reveals Pre-EUA Fixation of Pemivibart (VYD2311) Escape Constellations in SARS-CoV-2

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

DIVERSITY AND CLINICAL CORRELATIONS OF SARS-CoV-2 VARIANT DURING THE INTRODUCTION OF THE DELTA VARIANT IN GUATEMALA

Global Genomic Surveillance Reveals Pre-EUA Fixation of Pemivibart (VYD2311) Escape Constellations in SARS-CoV-2