Pro4S: prediction of protein solubility by fusing sequence, structure, and surface
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein solubility is a critical physicochemical property influencing protein stability, therapeutic efficacy, and overall developability in drug discovery. However, traditional experimental methods for assessing solubility are often resource-intensive and time-consuming. To address these limitations, computational approaches leveraging artificial intelligence have emerged, yet current models generally treat qualitative classification and quantitative regression as separate tasks and rely predominantly on sequence-based information, neglecting crucial structural and surface characteristics. Here, we introduce Pro4S, a novel multimodal predictive model that integrates protein language models, structural data, and surface descriptors using advanced contrastive learning techniques. Our unified framework achieves significant improvements in prediction accuracy, robustness, and generalizability for both qualitative and quantitative solubility assessments. Benchmark comparisons demonstrate that Pro4S consistently outperforms existing state-of-the-art predictors across diverse datasets. Furthermore, by applying Pro4S to the emerging area of de novo protein design, we validated a strong correlation between predicted solubility and experimental expression levels, reducing the proportion of non-expressed proteins by 52.7% while retaining 96.7% of highly expressed proteins. This highlights Pro4S’s potential to serve as a reliable upfront screening tool for increasing expression success rates and accelerating rational protein engineering.