SoluProtMut: Siamese Deep Learning for Predicting Solubility Effects of Protein Mutations with Experimental Validation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein solubility is an attractive engineering target because it is a critical property influencing the scalability of protein production and the success of therapeutic proteins in biomedical applications. However, predicting solubility changes upon mutation in silico is challenging due to data heterogeneity and protein bias. Here, we explore how different sources of solubility data can be used for machine learning and present SoluProtMut, a Siamese deep geometric neural network trained to predict the impact of mutations on protein solubility. Our final model was trained exclusively on deep mutational scanning data. We compare our model with five established solubility prediction methods. The model achieves state-of-the-art performance on an independent dataset of various proteins, especially in predicting the effects of multipoint mutations (informedness of 26.5 %). Our findings also reaffirm that the scarcity of solubility data continues to hamper progress in this field. To address this limitation, we experimentally quantified solubility changes for hundreds of single-point and multipoint mutants of haloalkane dehalogenase. Complemented with recent deep-mutational-scanning data on myoglobin, we employed both these data for external validation. Although the generalization to unseen proteins remains limited, our findings demonstrate the potential of integrating high-throughput assays with deep learning to improve the accuracy and scope of solubility prediction.
Highlights
We present a novel anti-symmetric Siamese architecture for mutational prediction on structures built on graph-based convolutional neural networks ensuring SE(3) invariance.
We demonstrate that a model trained only on single-point mutants of a single protein derived from high-throughput experiments generalizes to multipoint mutants of unseen proteins, achieving the state-of-the-art binary prediction informedness of 26.5 %.
We address the key limitation in the domain by extending the available data with 277 single-point and multipoint mutants of haloalkane dehalogenase labelled in-house and a selection of recently published 1037 single-point mutants of myoglobin.
We systematically study how different data subsets affect the performance of the trained models, revealing that including yeast-derived high-throughput data in training hampers generalization to low-throughput assays but recovers the performance on the yeast-derived myoglobin dataset.