SoluProtMut: Siamese Deep Learning for Predicting Solubility Effects of Protein Mutations with Experimental Validation

Jan Velecký
Hana Faldynová
Pedro Hermosilla
Nela Sendlerová
Mark Doerr
Sára Egersdorfová
Uwe Bornscheuer
Jiří Damborský
Zbyněk Prokop
Stanislav Mazurenko

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein solubility is an attractive engineering target because it is a critical property influencing the scalability of protein production and the success of therapeutic proteins in biomedical applications. However, predicting solubility changes upon mutation in silico is challenging due to data heterogeneity and protein bias. Here, we explore how different sources of solubility data can be used for machine learning and present SoluProtMut, a Siamese deep geometric neural network trained to predict the impact of mutations on protein solubility. Our final model was trained exclusively on deep mutational scanning data. We compare our model with five established solubility prediction methods. The model achieves state-of-the-art performance on an independent dataset of various proteins, especially in predicting the effects of multipoint mutations (informedness of 26.5 %). Our findings also reaffirm that the scarcity of solubility data continues to hamper progress in this field. To address this limitation, we experimentally quantified solubility changes for hundreds of single-point and multipoint mutants of haloalkane dehalogenase. Complemented with recent deep-mutational-scanning data on myoglobin, we employed both these data for external validation. Although the generalization to unseen proteins remains limited, our findings demonstrate the potential of integrating high-throughput assays with deep learning to improve the accuracy and scope of solubility prediction.

Highlights

We present a novel anti-symmetric Siamese architecture for mutational prediction on structures built on graph-based convolutional neural networks ensuring SE(3) invariance.

We demonstrate that a model trained only on single-point mutants of a single protein derived from high-throughput experiments generalizes to multipoint mutants of unseen proteins, achieving the state-of-the-art binary prediction informedness of 26.5 %.

We address the key limitation in the domain by extending the available data with 277 single-point and multipoint mutants of haloalkane dehalogenase labelled in-house and a selection of recently published 1037 single-point mutants of myoglobin.

We systematically study how different data subsets affect the performance of the trained models, revealing that including yeast-derived high-throughput data in training hampers generalization to low-throughput assays but recovers the performance on the yeast-derived myoglobin dataset.

Version published to 10.1101/2025.09.26.676459 on bioRxiv
Sep 27, 2025

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

This article has 5 authors:
1. Mujeebu Rehman
2. Qinghua Liu
3. Muhammad Javed
4. Ali Ghulam
5. Teerath Kumar
This article has no evaluationsLatest version Dec 11, 2025
Drug discovery guided by maximum drug likeness

This article has 3 authors:
1. Hao-Yu Zhu
2. Lu Xu
3. Wei Shi
This article has no evaluationsLatest version Dec 31, 2025
Uncertainty-quantified deep learning enables reliable protein-drug interaction prediction

This article has 1 author:
1. Akshay Balaji
This article has no evaluationsLatest version Dec 17, 2025

Discuss this preprint

Listed in

Abstract

Highlights

Article activity feed

Related articles

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

Drug discovery guided by maximum drug likeness

Uncertainty-quantified deep learning enables reliable protein-drug interaction prediction