Machine Learning Driven Simulations of the SARS-CoV-2 Fitness Landscape from Deep Mutational Scanning Experiments
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting protein variant effects is a key challenge in preparing for pathogenic viral strains, understanding mutation-linked diseases, and designing new proteins. Protein sequence-structure-function relationships are difficult to model due to complex allosteric and epistatic effects. To investigate efficient modeling strategies, we trained supervised machine learning (ML) models with deep mutational scanning (DMS) libraries of SARS-CoV-2 receptor binding domain (RBD) sequences labeled with angiotensin converting enzyme 2 (ACE2) binding affinity. These models demonstrate superior performance predicting combinatorial mutation effects compared to adding or averaging the effects of point mutations and exhibit strong extrapolative performance ranking omicron variants when training only on wild type (WT) variants. We characterize the RBD fitness landscape combining ML with Markov Chain Monte Carlo simulations to predict evolutionary patterns from the WT sequence, and generate comparable sequence profiles to high fitness sequences in DMS data predicting mutations in unseen omicron variants. These models provide insight into the relationship between RBD sequence elements, and offer a new perspective on the use of DMS to predict emerging viral strains, which we anticipate will be applicable to other evolutionary prediction tasks. To facilitate application and future development of this strategy, we introduce Mavenets: https://github.com/SztainLab/mavenets.