Data-efficient protein mutational effect prediction with weak supervision by molecular simulation and protein language models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine learning-based protein mutational effect prediction is widely used in protein engineering and pathogenicity prediction, but training data scarcity remains a major challenge due to high costs of experimental measurements. A previous study proposed data augmentation using computational estimates by molecular simulation (1). However, this approach has been limited to predicting mutational effects on thermostability. Here, we present a new data augmentation method that combines molecular simulation with zero-shot prediction computed by protein language models. These computational estimates serve as “weak” training data to supplement experimental training data. Our method dynamically adjusts the weight and inclusion of weak training data based on available experimental training data. This reduces potential negative impacts of weak training data while extending applicability to diverse protein properties such as binding affinity and enzymatic activity. Benchmark tests demonstrate that our method improves prediction accuracy particularly when experimental training data are scarce. These results indicate the capability of our approach to advance protein engineering and pathogenicity prediction in small data regimes.

Article activity feed