Fitness translocation: improving variant effect prediction with biologically-grounded data augmentation

Adrien Mialland
Shuzo Fukunaga
Riku Katsuki
Yunfei Dong
Hideki Yamaguchi
Yutaka Saito

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Predicting the functional effects of protein variants (variant effect prediction) is essential in protein engineering but remains challenging due to the scarcity of fitness data for training prediction models. To address this limitation, we introduce a data augmentation strategy called fitness translocation, which leverages variant fitness data from homologous proteins to enhance prediction models for a target protein. Using embeddings from protein language models, our method computes the differences between the homologs wild type and its variants, which are applied to the target wild type to generate its synthetic variants in the embedding space. We evaluate this approach on three protein families: IGPS, GFP, and SARS-CoV-2 spike proteins, under various prediction models and training data sizes. Fitness translocation consistently improves prediction accuracy, especially under limited training data. Moreover, accuracy improvement is observed even between remote homologs with sequence identity as low as 35%. These results highlight the potential of data-efficient protein engineering by reusing fitness data previously accumulated in homologs. The code is available at https://github.com/adrienmialland/ProtFitTrans .

Version published to 10.1101/2024.12.17.628831 on bioRxiv
Dec 20, 2024

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

This article has 7 authors:
1. Valentina Carbonari
2. Annamaria Defilippo
3. Ugo Lomoio
4. Caterina Francesca Perri
5. Barbara Puccio
6. Pierangelo Veltri
7. Pietro Hiram Guzzi
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

A Survey on Efficient Protein Language Models

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome