Data-efficient protein mutational effect prediction with weak supervision by molecular simulation and protein language models

Teppei Deguchi
Nur Syatila Ab Ghani
Yoichi Kurumida
Shinji Iida
Kaito Kobayashi
Yutaka Saito

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine learning-based protein mutational effect prediction is widely used in protein engineering and pathogenicity prediction, but training data scarcity remains a major challenge due to high costs of experimental measurements. A previous study proposed data augmentation using computational estimates by molecular simulation. However, this approach has been limited to predicting mutational effects on thermostability. Here, we present a new data augmentation method that combines molecular simulation with zero-shot prediction computed by protein language models. These computational estimates serve as “weak” training data to supplement experimental training data. Our method dynamically adjusts the weight and inclusion of weak training data based on available experimental training data. This reduces potential negative impacts of weak training data while extending applicability to diverse protein properties such as binding affinity and enzymatic activity. Benchmark tests demonstrate that our method improves prediction accuracy particularly when experimental training data are scarce. These results indicate the capability of our approach to advance protein engineering and pathogenicity prediction in small data regimes.

Version published to 10.1101/2025.04.08.647800 on bioRxiv
Apr 14, 2025

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

This article has 5 authors:
1. Mujeebu Rehman
2. Qinghua Liu
3. Muhammad Javed
4. Ali Ghulam
5. Teerath Kumar
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

A Survey on Efficient Protein Language Models

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction