Fitness translocation: improving variant effect prediction with biologically-grounded data augmentation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Predicting the functional effects of protein variants (variant effect prediction) is essential in protein engineering but remains challenging due to the scarcity of fitness data for training prediction models. To address this limitation, we introduce a data augmentation strategy called fitness translocation, which leverages variant fitness data from homologous proteins to enhance prediction models for a target protein. Using embeddings from protein language models, our method computes the differences between the homologs wild type and its variants, which are applied to the target wild type to generate its synthetic variants in the embedding space. We evaluate this approach on three protein families: IGPS, GFP, and SARS-CoV-2 spike proteins, under various prediction models and training data sizes. Fitness translocation consistently improves prediction accuracy, especially under limited training data. Moreover, accuracy improvement is observed even between remote homologs with sequence identity as low as 35%. These results highlight the potential of data-efficient protein engineering by reusing fitness data previously accumulated in homologs. The code is available at https://github.com/adrienmialland/ProtFitTrans .

Article activity feed